Sign In/My Account | View Cart  
advertisement


Listen Print

TREX Basics

by J. David Eisenberg
April 11, 2001

Overview

In this article, we'll explore the TREX markup language for validating XML documents, focusing on validating a subset of XMLNews-Story Markup Language. Although the XMLNews-Story markup language has been superseded by the News Industry Text Format, we use the old version because it's simple, it looks a great deal like HTML, and it lets us easily show some of TREX's features.

TREX's author, James Clark, says,

"A TREX pattern specifies a pattern for the structure and content of an XML document. A TREX pattern thus identifies a class of XML documents consisting of those documents that match the pattern. A TREX pattern is itself an XML document."

A TREX pattern's outer parts

A TREX pattern is enclosed within a <grammar> element. This is followed by a <start> element that describes the pattern. The root element of a news story is the <nitf> element, so that's where we begin our TREX specification.

<grammar>
   <start>
      <element name="nitf">
      <!-- specifications go here -->
      </element>
   </start>
</grammar>

The <grammar> and <start> elements are required only if you want to modularize your pattern by using definitions. (See below.) Since most non-trivial TREX patterns use definitions, we'll start using them right away.

Sub-Elements

An news story consists of a <head> and <body> element, in that order. The <head> contains a <title>, which has string data as its contents. All of these are required elements, and TREX specifies them as follows:

<grammar>
   <start>
      <element name="nitf">
         <element name="head">
            <element name="title">
               <anyString/>
            </element>
         </element>

         <element name="body">
            <!-- its specification goes here -->
         </element>
      </element>
   </start>
</grammar>

Multiple Occurrences of Elements

One of TREX's greatest strengths is the ease with which you can specify the number of occurrences that an element or elements must have. Specifying an <element> all by itself means that it is required, and that it must occur exactly once. Specifying multiple occurrences is easy in TREX.

Enclose elements in...to specify...
<optional> zero or one occurrence
<zeroOrMore> zero or more occurrences
<oneOrMore> one or more occurrences

We need this information to describe the <body> of a news story, as it starts with optional header information. This header information consists of a <body.head> element, which contains an optional <hedline> (yes, it's really spelled that way) and zero or more <byline> elements. Each of these has sub-elements, as shown below in the TREX pattern. This entire text would go between the <element name="body"> and its corresponding </element>.

<element name="body.head">
   <optional>
   <element name="hedline">
      <element name="hl1">
         <anyString/>
      </element>
   </element>
   </optional>

   <zeroOrMore>
      <element name="byline">
         <element name="bytag">
            <anyString/>
         </element>
      </element>
   </zeroOrMore>
 </element>

The specification, as it currently stands, looks like this, and it will correctly validate this news story.(Both links open in a new window.)

Try It Yourself

You can download TREX and see that the document is, indeed, valid. If you're on a Windows system, you have an executable already available to you. If you're using Linux or Unix, download trex.jar, xp.jar, and sax2.jar, and use this shell script:

#!/bin/sh
# first file is the TREX file; second file is the XML to validate

java -Dcom.thaiopensource.trex.util.Parser=com.jclark.xml.sax.Driver \
-cp ./trex.jar:./sax2.jar:./xp.jar \
com.thaiopensource.trex.util.Driver $1 $2

Modularizing the Specification

As a grammar gets more complex, it makes less sense to have it all in one huge block. TREX lets you <define> a series of elements and then refer to the defintions. We'll take the information for the body header of a news article and put it after the <start> element. Here's what the last part of the file now looks like:

    ...
         <element name="body">
            <ref name="body_header"/>
         </element>
      </element>
   </start>
   
    <define name="body_header">
        <element name="body.head">
           <optional>
               <element name="hedline">
                  <element name="hl1">
                     <anyString/>
                  </element>
               </element>
           </optional>
           <zeroOrMore>
              <element name="byline">
                 <element name="bytag">
                    <anyString/>
                 </element>
              </element>
           </zeroOrMore>
        </element>
    </define>
</grammar>

You can make recursive definitions. For example, we can say that a block_item is zero or more choices of a paragraph, <p>, the empty image element, <img>, or an unordered list <ul>. We use TREX's aptly named <choice> specifier in the pattern below.

<define name="block_item">
    <zeroOrMore>
        <choice>
            <element name="p"><anyString /></element>
            <element name="img"><empty/></element>
            <ref name="unordered_list"/>
        </choice>
    </zeroOrMore>
</define>

Note that TREX requires you to explicitly define elements which have neither children nor attributes as empty. Since we haven't learned about attributes yet, we've specified the image element as <empty/>.

Now we can define an unordered list, which can itself contain block items; i.e., paragraphs, images, and lists within lists.

<define name="unordered_list">
    <element name="ul">
        <oneOrMore>
            <element name="li">
               <ref name="block_item"/>
            </element>
        </oneOrMore>
    </element>
</define>

Mixed Content

Some information requires text that isn't between tags. For example, a news story <location> can look like

   The movie was filmed in
   <location>
    <city>Pekin</city>, a
    small city in
    <state>Illinois</state>
   </location>.

The bold red text above is inside the <location> tag, but it isn't part of any sub-element. That makes <location> a mixed element, specified (in part) like

<define name="location_element">
    <element name="location">
        <mixed>
            <zeroOrMore>
                <choice>
                    <element name="city"><anyString/></element>
                    <element name="state"><anyString/></element>
                    <element name="region"><anyString/></element>
                </choice>
            </zeroOrMore>
        </mixed>
    </element>
</define>

You can see the TREX and a sample XML document in new windows.

Of course, XML does not consist of elements alone, as we'll see in the next section.

Pages: 1, 2

Next Pagearrow