Pyxie

March 15, 2000

Introduction

Pyxie is an open source XML processing library that presents an alternative to current methods of handling XML. A central part of Pyxie is the simple, line-oriented notation it uses to describe the information communicated by an XML parser to an XML application. This notation is known as PYX.

PYX is based on a concept from the SGML world known as ESIS. ESIS was popularized by James Clark's SGML parsers. (Clarks' first parser was sgmls, a C-based parser built on top of the arcsgml parser developed by Dr. Charles Goldfarb, the inventor of SGML. Then came the hugely popular nsgmls, which was a completely new SGML parsing application implemented in C++.)

The PYX notation facilitates a useful XML processing paradigm that presents an alternative to SAX or DOM based API programming of XML documents. PYX is particularly useful for pipeline processing, in which the output of one application becomes the input to another application. We will see an example of this later on.

In this article, we take a look at the origins and philosophy behind PYX. We then show how PYX can be combined with non-XML aware tools to do useful work with "one-liners."

The PYX notation for XML

PYX is a line-oriented notation. Each line of PYX contains the information for a single parsing event. The parsing events of interest in PYX are:

start-tags
end-tags
attributes
character data
processing instructions

The first character of each line of PYX tells you what type of event you are dealing with, as follows:

PYX Notation
(	start-tag
)	end-tag
A	attribute
-	character data
?	processing instruction

The Pyxie project contains two utilities to produce PYX notation from XML: xmln and xmlv.

xmln is a standalone utility program that generates PYX from any well-formed XML document. It is written in C, using James Clark's expat library.

xmlv is also a standalone utility program that generates PYX, but it performs a validating parse of the input XML. It is written in C, and uses Richard Tobin's RXP validating XML parser library.

Examples of PYX Notation

Example 1 shows a small XML file that contains at least one of each of the constructs relevant in PYX notation:


                        
                        
<Person>

<?A4TypeSetter PageBreak?>

<Surname>McGrath</Surname>

<Given>Sean</Given>

<e-mail type="internet">sean@digitome.com</e-mail>

</Person>

Example 1: Sample XML document

We generate PYX from this XML by using the xmln utility. This results in the following output:


                        
                        
(Person

-\n

?A4TypeSetter PageBreak

-\n

(Surname

-McGrath

)Surname

-\n

(Given

-Sean

)Given

-\n

(e-mail

Atype internet

-sean@digitome.com

)e-mail

-\n

)Person

Example 2 is an XML document that contains some of the constructs that can appear within an XML document but are not exposed in PYX:


                        
                        
<!DOCTYPE foo SYSTEM "http://www.digitome.com/foo.dtd">

<!-- This document has a <foo> element -->

<foo not="not">

<![CDATA[

Although this looks like another <foo> start-tag

it is not.

]]>

&#x20;Hello

</foo>

Example 2: A more syntactically complex XML document

Parsing with the xmln utility yields produces the following PYX:


                        
                        
(foo

Anot not

-\n

-\n

-Athough this looks like another <foo> start-tag

-\n

-it is not.

-\n

-\n

- 

-Hello

-\n

)foo

Some information present in the original XML document is clearly not present in the PYX. Information about the DTD, the contents of the comment, the presence of a CDATA section, and the Unicode character entity reference are either transparent or completely hidden from view.

Having said that, because we are using true XML parsers to generate the PYX, we know that the segregation of data into start-tags, end-tags, character data, attributes, and processing instructions is 100% reliable. In particular, the start-tags occurring in the comment and in the marked section are handled properly for us by xmln.

Intuitively speaking, PYX concentrates on the logical form of an XML document. It does not concern itself with the physical aspects of the document, such as its entity structure. This distinction was made explicit in SGML, but is left implicit in XML. To further understand the distinction, we need to look back at the SGML standard, which gave rise to the notion of PYX, in the form of the ESIS concept.

The Origins of PYX

XML is an application profile of SGML. The definitive reference to SGML is Dr. Charles F. Goldfarb's "The SGML Handbook" (ISBN 0-19-853737-9).

Attachment 1 of appendix B of this book defines ESIS -- Element, Structure, Information, Set. ESIS is defined as the set of information that a "structure-controlled application" is permitted to act upon. To understand this, we must note that SGML distinguishes between two classes of SGML application:

Structure-Controlled Applications

These are applications that are concerned with the logical structure of a document: how the document is composed in terms of elements, attributes, character data, and so on. Structure-controlled applications are not concerned with the physical structure of a document, i.e., the entities that make up the document text.
Markup-Sensitive Applications

These are applications that are concerned with the physical structure of a document. That is, applications that care, on a character-by-character basis, about the structure of an XML document. In particular, applications that concern themselves with entity declarations and general entity references.

ESIS and James Clark's SGML parsers

As befits a truly generalized markup language, the SGML standard does not define a notation for ESIS. That is, you will not find a specified syntax for denoting the events that make up the ESIS information set.

Instead, the SGML standard describes the information contained in ESIS in abstract terms. It is up to an application to pick a notation for the ESIS. By far the most popular ESIS notation in the world is the form produced by James Clark's sgmls and nsgmls parsers.

Like ESIS before it, PYX defines a set of information that structure-controlled applications act upon. PYX borrows its notation from that generated by nsgmls. Naturally, as XML is so much simpler than full SGML, the PYX notation is a very small subset of the full ESIS notation generated by nsgmls.

Simple PYX Applications

Given some XML data and xmln and xmlv utilities, what can you do? We will look at some simple examples:

Parsing
Element counting
XML-aware "grepping" (string searching)
Reporting

Parsing with xmln and xmlv

Trivially, you can use xmln and xmlv to check for well-formed and valid XML instances. On Windows, you can check a file foo.xml for well-formedness like this:

xmln foo.xml >nul

On Unix you might use something like this:

xmln foo.xml >/dev/null

To perform a validating, rather than a non-validating parse, the same syntax will work -- just use xmlv rather than xmln.

Element counting

It is tempting to perform simple element counting tasks on XML documents using standard text processing tools, such as the grep family of pattern matching utilities. A first shot at a grep command to count the number of foo elements in Example 2 might look like this:

grep "<foo>" example2.xml

However, this will generate three hits, owing to the two false positives: one in the comment declaration and one in the CDATA section.

Parsing the document and converting to PYX with xmln can be used to resolve this problem:

xmln example2.xml | grep "^(foo$"

Note the use of the ^ and $ meta-characters in grep, which anchor the pattern to the start and end of the line respectively. This ensures that only foo start tags will match the pattern. There are no "false positives," as there can be if the parsing stage is skipped.

XML-aware grepping

In Example 2, the word "not" appears in three different forms.

As an attribute name
As an attribute value
As PCDATA

Differentiating between the three cases is easy with PYX. For example, the following command will match all occurrences of the word "not" as an attribute name:

xmln example2.xml | grep "^Anot "

Reporting

Armed with a text processing line-oriented "little language" such as the venerable awk, the xmln and xmlv utilities are capable of doing a good deal of work -- even in modestly sized one-liners.

For example, take a look at this one-liner. What does it do? (An answer is given at the end of this article.)

xmln fig1.xml | grep "^-" | awk "{print substr($0,2)}" | wc -w

(With some Unix shells such as bash, you may need to escape the $ with a backslash \).

Conclusion

The PYX notation provides an alternative to purely API based XML processing as exemplified by SAX and DOM. The sheer simplicity of PYX is its greatest asset. Although it would be possible to add more and more "events" to the PYX information set, where would the process end? The ultimate consequence would be that PYX would provide enough information to recreate byte-for-byte the input XML instance.

This level of power would have a trickle-down effect on the complexity of simple PYX programs. A more flexible approach (again borrowing heavily from the SGML heritage) would be to be able to tell xmln and xmlv which events or sets of events are of interest, to tailor the information set on-the-fly. This is effectively the grove plan idea introduced in the SGML extended facilities annex.

I believe that the distinction in SGML between structure-controlled and markup-sensitive applications is a useful one, and one that deserves a place in the XML world. The DOM, for example, attempts to straddle both camps, and this shows in the complexity of the API.

SAX has become a sort of de facto expression of structure-controlled information set in the XML world, but it is expressed purely as an API and has no associated notation.

In a forthcoming article, we will take a look at the PYX-based XML processing facilities provided in the Pyxie library.

The Pyxie library and the PYX notation are fully developed in my book XML Processing with Python, soon to be published by Prentice Hall.

Finally, time to reveal the answer to the one liner awk quiz above: it counts the number of words (not markup) in the fig1.xml file.