XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

TREX Basics

April 11, 2001

Overview

In this article, we'll explore the TREX markup language for validating XML documents, focusing on validating a subset of XMLNews-Story Markup Language. Although the XMLNews-Story markup language has been superseded by the News Industry Text Format, we use the old version because it's simple, it looks a great deal like HTML, and it lets us easily show some of TREX's features.

TREX's author, James Clark, says,

"A TREX pattern specifies a pattern for the structure and content of an XML document. A TREX pattern thus identifies a class of XML documents consisting of those documents that match the pattern. A TREX pattern is itself an XML document."

A TREX pattern's outer parts

A TREX pattern is enclosed within a <grammar> element. This is followed by a <start> element that describes the pattern. The root element of a news story is the <nitf> element, so that's where we begin our TREX specification.

<grammar>
   <start>
      <element name="nitf">
      <!-- specifications go here -->
      </element>
   </start>
</grammar>

The <grammar> and <start> elements are required only if you want to modularize your pattern by using definitions. (See below.) Since most non-trivial TREX patterns use definitions, we'll start using them right away.

Sub-Elements

An news story consists of a <head> and <body> element, in that order. The <head> contains a <title>, which has string data as its contents. All of these are required elements, and TREX specifies them as follows:

<grammar>
   <start>
      <element name="nitf">
         <element name="head">
            <element name="title">
               <anyString/>
            </element>
         </element>

         <element name="body">
            <!-- its specification goes here -->
         </element>
      </element>
   </start>
</grammar>

Multiple Occurrences of Elements

One of TREX's greatest strengths is the ease with which you can specify the number of occurrences that an element or elements must have. Specifying an <element> all by itself means that it is required, and that it must occur exactly once. Specifying multiple occurrences is easy in TREX.

Enclose elements in...to specify...
<optional> zero or one occurrence
<zeroOrMore> zero or more occurrences
<oneOrMore> one or more occurrences

We need this information to describe the <body> of a news story, as it starts with optional header information. This header information consists of a <body.head> element, which contains an optional <hedline> (yes, it's really spelled that way) and zero or more <byline> elements. Each of these has sub-elements, as shown below in the TREX pattern. This entire text would go between the <element name="body"> and its corresponding </element>.

<element name="body.head">
   <optional>
   <element name="hedline">
      <element name="hl1">
         <anyString/>
      </element>
   </element>
   </optional>

   <zeroOrMore>
      <element name="byline">
         <element name="bytag">
            <anyString/>
         </element>
      </element>
   </zeroOrMore>
 </element>

The specification, as it currently stands, looks like this, and it will correctly validate this news story.(Both links open in a new window.)

Try It Yourself

You can download TREX and see that the document is, indeed, valid. If you're on a Windows system, you have an executable already available to you. If you're using Linux or Unix, download trex.jar, xp.jar, and sax2.jar, and use this shell script:

#!/bin/sh
# first file is the TREX file; second file is the XML to validate

java -Dcom.thaiopensource.trex.util.Parser=com.jclark.xml.sax.Driver \
-cp ./trex.jar:./sax2.jar:./xp.jar \
com.thaiopensource.trex.util.Driver $1 $2

Modularizing the Specification

As a grammar gets more complex, it makes less sense to have it all in one huge block. TREX lets you <define> a series of elements and then refer to the defintions. We'll take the information for the body header of a news article and put it after the <start> element. Here's what the last part of the file now looks like:

    ...
         <element name="body">
            <ref name="body_header"/>
         </element>
      </element>
   </start>
   
    <define name="body_header">
        <element name="body.head">
           <optional>
               <element name="hedline">
                  <element name="hl1">
                     <anyString/>
                  </element>
               </element>
           </optional>
           <zeroOrMore>
              <element name="byline">
                 <element name="bytag">
                    <anyString/>
                 </element>
              </element>
           </zeroOrMore>
        </element>
    </define>
</grammar>

You can make recursive definitions. For example, we can say that a block_item is zero or more choices of a paragraph, <p>, the empty image element, <img>, or an unordered list <ul>. We use TREX's aptly named <choice> specifier in the pattern below.

<define name="block_item">
    <zeroOrMore>
        <choice>
            <element name="p"><anyString /></element>
            <element name="img"><empty/></element>
            <ref name="unordered_list"/>
        </choice>
    </zeroOrMore>
</define>

Note that TREX requires you to explicitly define elements which have neither children nor attributes as empty. Since we haven't learned about attributes yet, we've specified the image element as <empty/>.

Now we can define an unordered list, which can itself contain block items; i.e., paragraphs, images, and lists within lists.

<define name="unordered_list">
    <element name="ul">
        <oneOrMore>
            <element name="li">
               <ref name="block_item"/>
            </element>
        </oneOrMore>
    </element>
</define>

Mixed Content

Some information requires text that isn't between tags. For example, a news story <location> can look like

   The movie was filmed in
   <location>
    <city>Pekin</city>, a
    small city in
    <state>Illinois</state>
   </location>.

The bold red text above is inside the <location> tag, but it isn't part of any sub-element. That makes <location> a mixed element, specified (in part) like

<define name="location_element">
    <element name="location">
        <mixed>
            <zeroOrMore>
                <choice>
                    <element name="city"><anyString/></element>
                    <element name="state"><anyString/></element>
                    <element name="region"><anyString/></element>
                </choice>
            </zeroOrMore>
        </mixed>
    </element>
</define>

You can see the TREX and a sample XML document in new windows.

Of course, XML does not consist of elements alone, as we'll see in the next section.

Pages: 1, 2

Next Pagearrow