Menu

TREX Basics

April 11, 2001

J. David Eisenberg

Overview

In this article, we'll explore the TREX markup language for validating XML documents, focusing on validating a subset of XMLNews-Story Markup Language. Although the XMLNews-Story markup language has been superseded by the News Industry Text Format, we use the old version because it's simple, it looks a great deal like HTML, and it lets us easily show some of TREX's features.

TREX's author, James Clark, says,

"A TREX pattern specifies a pattern for the structure and content of an XML document. A TREX pattern thus identifies a class of XML documents consisting of those documents that match the pattern. A TREX pattern is itself an XML document."

A TREX pattern's outer parts

A TREX pattern is enclosed within a <grammar> element. This is followed by a <start> element that describes the pattern. The root element of a news story is the <nitf> element, so that's where we begin our TREX specification.


<grammar>

   <start>

      <element name="nitf">

      <!-- specifications go here -->

      </element>

   </start>

</grammar>

The <grammar> and <start> elements are required only if you want to modularize your pattern by using definitions. (See below.) Since most non-trivial TREX patterns use definitions, we'll start using them right away.

Sub-Elements

An news story consists of a <head> and <body> element, in that order. The <head> contains a <title>, which has string data as its contents. All of these are required elements, and TREX specifies them as follows:


<grammar>

   <start>

      <element name="nitf">

         <element name="head">

            <element name="title">

               <anyString/>

            </element>

         </element>



         <element name="body">

            <!-- its specification goes here -->

         </element>

      </element>

   </start>

</grammar>

Multiple Occurrences of Elements

One of TREX's greatest strengths is the ease with which you can specify the number of occurrences that an element or elements must have. Specifying an <element> all by itself means that it is required, and that it must occur exactly once. Specifying multiple occurrences is easy in TREX.

Enclose elements in... to specify...
<optional> zero or one occurrence
<zeroOrMore> zero or more occurrences
<oneOrMore> one or more occurrences

We need this information to describe the <body> of a news story, as it starts with optional header information. This header information consists of a <body.head> element, which contains an optional <hedline> (yes, it's really spelled that way) and zero or more <byline> elements. Each of these has sub-elements, as shown below in the TREX pattern. This entire text would go between the <element name="body"> and its corresponding </element>.


<element name="body.head">

   <optional>

   <element name="hedline">

      <element name="hl1">

         <anyString/>

      </element>

   </element>

   </optional>



   <zeroOrMore>

      <element name="byline">

         <element name="bytag">

            <anyString/>

         </element>

      </element>

   </zeroOrMore>

 </element>

The specification, as it currently stands, looks like this, and it will correctly validate this news story.(Both links open in a new window.)

Try It Yourself

You can download TREX and see that the document is, indeed, valid. If you're on a Windows system, you have an executable already available to you. If you're using Linux or Unix, download trex.jar, xp.jar, and sax2.jar, and use this shell script:


#!/bin/sh

# first file is the TREX file; second file is the XML to validate



java -Dcom.thaiopensource.trex.util.Parser=com.jclark.xml.sax.Driver \

-cp ./trex.jar:./sax2.jar:./xp.jar \

com.thaiopensource.trex.util.Driver $1 $2

Modularizing the Specification

As a grammar gets more complex, it makes less sense to have it all in one huge block. TREX lets you <define> a series of elements and then refer to the defintions. We'll take the information for the body header of a news article and put it after the <start> element. Here's what the last part of the file now looks like:


    ...

         <element name="body">

            <ref name="body_header"/>

         </element>

      </element>

   </start>

   

    <define name="body_header">

        <element name="body.head">

           <optional>

               <element name="hedline">

                  <element name="hl1">

                     <anyString/>

                  </element>

               </element>

           </optional>

           <zeroOrMore>

              <element name="byline">

                 <element name="bytag">

                    <anyString/>

                 </element>

              </element>

           </zeroOrMore>

        </element>

    </define>

</grammar>

You can make recursive definitions. For example, we can say that a block_item is zero or more choices of a paragraph, <p>, the empty image element, <img>, or an unordered list <ul>. We use TREX's aptly named <choice> specifier in the pattern below.


<define name="block_item">

    <zeroOrMore>

        <choice>

            <element name="p"><anyString /></element>

            <element name="img"><empty/></element>

            <ref name="unordered_list"/>

        </choice>

    </zeroOrMore>

</define>

Note that TREX requires you to explicitly define elements which have neither children nor attributes as empty. Since we haven't learned about attributes yet, we've specified the image element as <empty/>.

Now we can define an unordered list, which can itself contain block items; i.e., paragraphs, images, and lists within lists.


<define name="unordered_list">

    <element name="ul">

        <oneOrMore>

            <element name="li">

               <ref name="block_item"/>

            </element>

        </oneOrMore>

    </element>

</define>

Mixed Content

Some information requires text that isn't between tags. For example, a news story <location> can look like


   The movie was filmed in

   <location>

    <city>Pekin</city>, a

    small city in

    <state>Illinois</state>

   </location>.

The bold red text above is inside the <location> tag, but it isn't part of any sub-element. That makes <location> a mixed element, specified (in part) like


<define name="location_element">

    <element name="location">

        <mixed>

            <zeroOrMore>

                <choice>

                    <element name="city"><anyString/></element>

                    <element name="state"><anyString/></element>

                    <element name="region"><anyString/></element>

                </choice>

            </zeroOrMore>

        </mixed>

    </element>

</define>

You can see the TREX and a sample XML document in new windows.

Of course, XML does not consist of elements alone, as we'll see in the next section.

Attributes

XML elements can have attributes, and TREX allows you to specify them in great detail. A news story, like HTML, can include an <img> tag which has a required src and optional align, width, and height attributes. The alignment can have only three possible values, so we specify them explicitly with the <string> element.

The width and height must be positive integers. Since TREX doesn't have any default type system, the current implementation of TREX reaches out to XML Schema and uses its type system. That means we need to specify a namespace when we write the pattern for an image element.


<element name="img" xmlns:xsd="http://www.w3.org/2000/10/XMLSchema">

    <attribute name="src">

       <anyString/>

    </attribute>

    <optional>

       <attribute name="align">

          <choice>

             <string>left</string>

             <string>center</string>

             <string>right</string>

          </choice>

       </attribute>

    </optional>

    <optional>

       <attribute name="width">

          <data type="xsd:positiveInteger"/>

       </attribute>

    </optional>

    <optional>

       <attribute name="height">

          <data type="xsd:positiveInteger"/>

       </attribute>

    </optional>

</element>

Notice that <optional> can be used with <attribute> to specify an optional attribute, just as it is used with <element> to specify an optional element. This uniform treatment of attributes and elements gives TREX the power to express complex grammars with a compact vocabulary. (For all the details, check out the a complete TREX pattern that uses attributes and the XML News Story that uses an image.) In the TREX file, the xmlns:xsd specification has been placed in the outermost <grammar> tag so that it's available throughout the file.

Just as it was possible to create a reusable element specification, it's possible to create a set of attributes that can be reused by many tags. For example, both table body (<tbody>) and table header (<th>) elements have identical attributes for determining their horizontal and vertical alignment. This makes those attributes a perfect candidate for a definition,


<define name="alignment">

    <optional>

        <attribute name="align">

            <choice>

            <string>left</string>

            <string>center</string>

            <string>right</string>

            </choice>

        </attribute>

    </optional>

</define>

which may be used in different elements:


<element name="tbody">

<ref name="alignment"/>

</element>



<element name="th">

    <anyString/>

    <ref name="alignment"/>

    <optional>

        <attribute name="rowspan">

           <data type="xsd:positiveInteger">

        </attribute>

    </optional>

    

    <optional>

        <attribute name="colspan">

           <data type="xsd:positiveInteger">

        </attribute>

    </optional>



</element>

Note that the <th> tag has attributes in addition to those included via the reference to the definition.

Merging Grammars

As with DTDs, TREX lets you write a grammar in one file and include it in another file. We could take the block_item definition that we wrote earlier and put it in a file named block_spec.trex:


<grammar>

    <define name="block_item">

        <zeroOrMore>

            <choice>

                <element name="p"><anyString /></element>

                <ref name="unordered_list"/>

                <ref name="location_element"/>

                <ref name="image_element"/>

            </choice>

        </zeroOrMore>

    </define>

</grammar>

Our main TREX file would use it like


<grammar>

   <include href="block_spec.trex"/>

   <start>

      <!-- remainder of specification -->

Now let's get a bit more advanced. Let's say that we want to use this block element specification for both XML News Story and XHTML verification. The problem is that news stories have <location> and <copyrite> block elements and XHTML doesn't; XHTML has a <blockquote> element, but news stories don't. So, we'll modify our include file as follows.


<grammar>

    <define name="block_item">

        <zeroOrMore>

            <choice>

                <element name="p"><anyString /></element>

                <ref name="unordered_list"/>

                <ref name="image_element"/>

                <ref name="custom_elements"/>

            </choice>

        </zeroOrMore>

    </define>

    

    <define name="custom_elements"/>

       <notAllowed/>

    </define>

</grammar>

The <notAllowed/> is a pattern placeholder that can never match anything. This would be a problem if the include file were to be used by itself, but our XMLNews-Story TREX pattern will replace the no-op pattern with this pattern:


<define name="custom_elements" combine="replace">

    <choice>

        <element name="location">

            <mixed>

                <zeroOrMore>

                <choice>

                    <element name="city"><anyString/></element>

                    <element name="state"><anyString/></element>

                    <element name="region"><anyString/></element>

                </choice>

                </zeroOrMore>

            </mixed>

        </element>

        <element name="copyrite">

            <anyString/>

        </element>

    </choice>

</define>

A TREX pattern to validate a subset of XHTML would replace it as follows.


<define name="custom_elements" combine="replace">

    <element name="blockquote">

        <mixed>

            <ref name="block_item"/>

        </mixed>

    </element>

</define>

Please consult the include file, the TREX pattern for a subset of XMLNews-Story, a sample XMLNews story, the TREX pattern for a subset of XHTML, and a sample XHTML file. for more details.

This include-and-override capability lets you develop a set of core patterns that can easily be modified for validating a wide variety of markup languages. Other options for the combine attribute are choice and group, which let you add to a definition without entirely replacing it.

Another advanced feature of TREX is the <concur> element, which lets you verify that your XML satisfies all of a number of patterns.

Summary

TREX is a powerful markup language that permits you to specify how other XML documents are to be validated. As with other specification languages, you can

  • specify an element with an ordered sequence of sub-elements;
  • specify an element with a choice of sub-elements;
  • permit mixed content (text outside of tags); and
  • specify attributes for tags.

Advanced features of TREX allow you to combine externally-defined grammars in highly sophisticated ways. For more information, consult James Clark's extensive TREX tutorial or the formal specification.