TREX Basics
Overview
In this article, we'll explore the TREX markup language for validating XML documents, focusing on validating a subset of XMLNews-Story Markup Language. Although the XMLNews-Story markup language has been superseded by the News Industry Text Format, we use the old version because it's simple, it looks a great deal like HTML, and it lets us easily show some of TREX's features.
TREX's author, James Clark, says,
"A TREX pattern specifies a pattern for the structure and content of an XML document. A TREX pattern thus identifies a class of XML documents consisting of those documents that match the pattern. A TREX pattern is itself an XML document."
A TREX pattern's outer parts
A TREX pattern is enclosed within a <grammar>
element. This is followed by a <start> element
that describes the pattern. The root element of a news story is the
<nitf> element, so that's where we begin our TREX
specification.
<grammar>
<start>
<element name="nitf">
<!-- specifications go here -->
</element>
</start>
</grammar>
The <grammar> and <start>
elements are required only if you want to modularize your pattern by
using definitions. (See below.) Since most
non-trivial TREX patterns use definitions, we'll start using them
right away.
Sub-Elements
An news story consists of a <head> and
<body> element, in that order. The
<head> contains a <title>, which
has string data as its contents. All of these are required elements,
and TREX specifies them as follows:
<grammar>
<start>
<element name="nitf">
<element name="head">
<element name="title">
<anyString/>
</element>
</element>
<element name="body">
<!-- its specification goes here -->
</element>
</element>
</start>
</grammar>
Multiple Occurrences of Elements
One of TREX's greatest strengths is the ease with which you can
specify the number of occurrences that an element or elements must
have. Specifying an <element> all by itself means
that it is required, and that it must occur exactly once. Specifying
multiple occurrences is easy in TREX.
| Enclose elements in... | to specify... |
|---|---|
<optional> |
zero or one occurrence |
<zeroOrMore> |
zero or more occurrences |
<oneOrMore> |
one or more occurrences |
We need this information to describe the <body>
of a news story, as it starts with optional header information. This
header information consists of a <body.head>
element, which contains an optional <hedline> (yes,
it's really spelled that way) and zero or more
<byline> elements. Each of these has sub-elements,
as shown below in the TREX pattern. This entire text would go between
the <element name="body"> and its corresponding
</element>.
<element name="body.head">
<optional>
<element name="hedline">
<element name="hl1">
<anyString/>
</element>
</element>
</optional>
<zeroOrMore>
<element name="byline">
<element name="bytag">
<anyString/>
</element>
</element>
</zeroOrMore>
</element>
The specification, as it currently stands, looks like this, and it will correctly validate this news story.(Both links open in a new window.)
Try It Yourself
You can download TREX and see that the document is, indeed, valid.
If you're on a Windows system, you have an executable already
available to you. If you're using Linux or Unix, download
trex.jar, xp.jar, and sax2.jar,
and use this shell script:
#!/bin/sh # first file is the TREX file; second file is the XML to validate java -Dcom.thaiopensource.trex.util.Parser=com.jclark.xml.sax.Driver \ -cp ./trex.jar:./sax2.jar:./xp.jar \ com.thaiopensource.trex.util.Driver $1 $2
Modularizing the Specification
As a grammar gets more complex, it makes less sense to have it all
in one huge block. TREX lets you <define> a series
of elements and then refer to the defintions. We'll take the
information for the body header of a news article and put it after the
<start> element. Here's what the last part of the
file now looks like:
...
<element name="body">
<ref name="body_header"/>
</element>
</element>
</start>
<define name="body_header">
<element name="body.head">
<optional>
<element name="hedline">
<element name="hl1">
<anyString/>
</element>
</element>
</optional>
<zeroOrMore>
<element name="byline">
<element name="bytag">
<anyString/>
</element>
</element>
</zeroOrMore>
</element>
</define>
</grammar>
You can make recursive definitions. For example, we can say
that a block_item is zero or more choices of a paragraph,
<p>, the empty image element,
<img>, or an unordered list
<ul>. We use TREX's aptly named
<choice> specifier in the pattern below.
<define name="block_item">
<zeroOrMore>
<choice>
<element name="p"><anyString /></element>
<element name="img"><empty/></element>
<ref name="unordered_list"/>
</choice>
</zeroOrMore>
</define>
Note that TREX requires you to explicitly define elements which have
neither children nor attributes as empty. Since we haven't learned
about attributes yet, we've specified the image element as
<empty/>.
Now we can define an unordered list, which can itself contain block items; i.e., paragraphs, images, and lists within lists.
<define name="unordered_list">
<element name="ul">
<oneOrMore>
<element name="li">
<ref name="block_item"/>
</element>
</oneOrMore>
</element>
</define>
Mixed Content
Some information requires text that isn't
between tags. For example, a news story <location>
can look like
The movie was filmed in
<location>
<city>Pekin</city>, a
small city in
<state>Illinois</state>
</location>.
The bold red text above is inside the
<location> tag, but it isn't part of any
sub-element. That makes <location> a mixed
element, specified (in part) like
<define name="location_element">
<element name="location">
<mixed>
<zeroOrMore>
<choice>
<element name="city"><anyString/></element>
<element name="state"><anyString/></element>
<element name="region"><anyString/></element>
</choice>
</zeroOrMore>
</mixed>
</element>
</define>
You can see the TREX and a sample XML document in new windows.
Of course, XML does not consist of elements alone, as we'll see in the next section.
Pages: 1, 2 |