Introducing PyRXP

February 11, 2004

PyRXP is a DTD validating XML parser developed by ReportLab. It is Python wrapper around RXP, a C parser developed by Richard Tobin and Henry Thompson of the Edinburgh Language Technology Group as the core of LT XML, "an integrated set of XML tools and a developers' tool-kit, including a C-based API". ReportLab is a vendor of database reporting software and very well known and respected in the Python community. PyRXP is a core component of many of ReportLab's open source and commercial components. PyRXP focuses on performance above all things by using a fast C parser and by strictly building a bare-bones Python structure of tuples and string buffers from XML source. RXP and PyRXP are both distributed under the GNU General Public License.

I downloaded the full tar/gzip distribution of PyRXP 0.9 for running on Python 2.3.2. Note: the archive does not create its own directory when unpacked, so you'll want to do so by hand:

$ mkdir pyRXP-0-9

$ cd pyRXP-0-9

$ tar zxvf ../pyRXP-0-9.tgz

[SNIP]

$ python setup.py install

[SNIP]

Source XML for the documentation comes in the distribution, but I didn't see an obvious way to build it so I just downloaded the PDF documentation.

Character trouble in tag land

PyRXP builds a bare bones tuple-based Python structure from an XML instance. To get a flavor of this structure, I tried to parse the same document I've been using in recent explorations of Python-XML tools (Listing 1).

Listing 1: Sample XML file (labels.xml) containing address labels

<?xml version="1.0" encoding="iso-8859-1"?>

<labels>

  <label added="2003-06-20">

    <quote>

      <!-- Mixed content -->

      <emph>Midwinter Spring</emph> is its own season&#8230;

    </quote>

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

  </label>

  <label added="2003-06-10">

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

  </label>

</labels>

My attempt was the code in listing 2:

Listing 2: Simple parse of XML in a file

import pyRXP

parser = pyRXP.Parser()

fobj = open('labels.xml').read()

#Introspection doesn't reveal any "parseFile"-like method

doc = parser.parse(fobj)

The result of this attempt was rather hair raising:

$ python listing2.py

Traceback (most recent call last):

  File "listing2.py", line 4, in ?

    doc = parser.parse(fobj)

pyRXP.Error: Error: 0x2026 is not a valid 8-bit XML character

 in unnamed entity at line 6 char 61 of [unknown]

error return=1

0x2026 is not a valid 8-bit XML character

Parse Failed!

The problem, besides the fact that the parser seemed to fail parsing a perfectly well-formed XML document, is that the error message is unhelpful. The phrase "valid 8-bit XML character" is meaningless. The XML character set is Unicode, with the restriction that some characters are not allowed. But there is no concept of "bits" in the idea of an XML character. Each character is merely an abstract code point. A character can be encoded into a storage format associated with a standard bit length such as UTF-8 (8 bit), but this really has nothing to do with the XML character model. To be fair, this and other concepts relating to Unicode can be rather arcane; but there are excellent resources to help clear things up, including Mike Brown's article "XML Tutorial--A reintroduction to XML with an emphasis on character encoding". For a very friendly discussion of Unicode focusing on the Python implementation there is " Unicode Support in Python (PDF)" by Marc-Andre Lemburg. I gather a lot of relevant notes on these matters in my Akara article "XML Character issues in Python".

At any rate, I pored over the PyRXP documentation expecting to find something I must have missed. I found a few properties that can be set on the parser and the closest I found was ExpandCharacterEntities. In effect it returns a character entity such as …, the one in the sample document, as the literal sequence of seven separate characters, starting with the ampersand and ending with the semicolon. This is a serious violation of the basic principles of XML, in which … is strictly one character rather than seven; further, it doesn't help me parse the sample file properly. I then checked the ReportLab mailing lists and found others who had run into the same problem. The responses from the developers were, more or less, that PyRXP raises a fatal error when presented with XML characters with Unicode ordinal greater than U+256, regardless of how they are represented. The unfortunate upshot of this is that PyRXP 0.9 is not an XML parser.

I only cover XML processing tools in this column; and, frankly, such a fundamental case of non-conformance would have been to my mind more than enough to disqualify PyRXP from discussion. Nevertheless, there was no way I was going to throw up my hands at this point. I have heard a lot of good things about PyRXP, and I'd like to be sure there is fair coverage of as broad a selection of Python-XML tools as possible. I pored through the docs again and found a bit that I'd overlooked the first time. Earlier on, in searching on whether users of the core C RXP parser also had this problem, I came across Norm Walsh's simple instruction to one such user: "I think you need to rebuild or reconfigure RXP with Unicode support. XML isn't 8-bit."

It turns out that the PyRXP developers have provided a start toward this. From the manual, "PyRXPU is the 16-bit Unicode aware version of pyRXP. It is currently only available the source distribution of pyRXP, since it is still 'alpha' quality. Please report any bugs you find with it."

It's still odd to tie the idea of bit width of a character encoding to the foundation of an XML parser (the phrase "16-bit Unicode" is almost as meaningless as "8-bit XML character") but PyRXPU seems well worth a try.

A Conformant Version of PyRXP?

It appears that, contrary to the note in the manual, PyRXPU is only available in CVS. I grabbed and built the CVS version like so:

$ cvs -d :pserver:anonymous@cvs.reportlab.sourceforge.net:/cvsroot/reportlab 

login

[SNIP]

$ cvs -d :pserver:anonymous@cvs.reportlab.sourceforge.net:/cvsroot/reportlab

co rl_addons/pyRXP

[SNIP]

$ cd rl_addons/pyRXP

$ python setup.py install

[SNIP]

I just hit "Enter" at the "CVS password" prompt.

Listing 3: Simple parse of XML in a file, reprise

import pyRXPU

parser = pyRXPU.Parser()

fobj = open('labels.xml').read()

#Introspection doesn't reveal any "parseFile"-like method

doc = parser.parse(fobj)

This time the parse is successful, and I was able to start digging into the resulting data structure as illustrated by jumping into the interpreter after running the script:

>>> import pprint

>>> pprint.pprint(doc)

(u'labels',

 None,

 [u'\n  ',

  (u'label',

   {u'added': u'2003-06-20'},

   [u'\n    ',

    (u'quote',

     None,

     [u'\n      \n      ',

      (u'emph', None, [u'Midwinter Spring'], None),

      u' is its own season\u2026\n    '],

     None),

    u'\n    ',

    (u'name', None, [u'Thomas Eliot'], None),

    u'\n    ',

    (u'address',

     None,

     [u'\n      ',

      (u'street', None, [u'3 Prufrock Lane'], None),

      u'\n      ',

      (u'city', None, [u'Stamford'], None),

      u'\n      ',

      (u'state', None, [u'CT'], None),

      u'\n    '],

     None),

    u'\n  '],

   None),

  u'\n  ',

  (u'label',

   {u'added': u'2003-06-10'},

   [u'\n    ',

    (u'name', None, [u'Ezra Pound'], None),

    u'\n    ',

    (u'address',

     None,

     [u'\n      ',

      (u'street', None, [u'45 Usura Place'], None),

      u'\n      ',

      (u'city', None, [u'Hailey'], None),

      u'\n      ',

      (u'state', None, [u'ID'], None),

      u'\n    '],

     None),

    u'\n  '],

   None),

  u'\n'],

 None)

I knew that the result would be a structure of Python primitives; thus, as in the last article, I used the pprint module to produce a representation I could follow easily. It's easy to see the basic pattern: elements become tuples with the node name as the first (Unicode) item, a dictionary of attributes or None as the second, and a list of contents or None as the third. The fourth is reserved for customized use. This data structure is quite simple, which is one of the attractions of PyRXPU; but it might be a bit cumbersome to navigate in order to extract patterns of data, especially in comparison to data binding tools.

As you can see, all strings are Unicode objects, which is very good. From my understanding, using the production version of PyRXP you only get "classic" string objects, which I do not recommend mixing into XML processing. You can see the character that was giving the production version such fits, that \u2026. Here it is properly treated. Nevertheless, the strange bit about "16-bit Unicode" made me wonder whether there were also any such conformance problems in PyRXPU. Certainly XML allows numerous characters above code point 65535. The following is the relevant production from the XML 1.0 spec:

Character Range

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]

         | [#x10000-#x10FFFF]

The accompanying comment is "any Unicode character, excluding the surrogate blocks, FFFE, and FFFF." Note that this permissiveness will open up even more now that XML 1.1 has just become a full W3C recommendation. Some formerly forbidden characters including the range from #x1 through #x8 have been allowed, strictly in the form of character references.

I tested the treatment of very high Unicode characters in PyRXPU, and it does seem to handle them well enough. If you're an archaeologist with an interest in the Mycenaean culture you might have an interest in Unicode character U+10000, "LINEAR B SYLLABLE B008 A", which is used in the XML document parsed in the following snippet:

>>> import pyRXPU

>>> p = pyRXPU.Parser()

>>> p.parse("<spam>Very high Unicode char: &#x10000;</spam>")

(u'spam', None, [u'Very high Unicode char: \U00010000'], None)

As you can see the character value becomes \U00010000 in Python. Python gets most Unicode matters right and deals with such high characters with aplomb whether you compile Python to store Unicode in 16 bits or 32 bits (again the bit width is not relevant to the Unicode character whatsoever but is merely a property of the chosen storage or encoding). It's good to have this confidence that PyRXPU is a conforming XML parser.

Benchmarks: A Lawyer's Best Friend

ReportLab bills PyRXP as "the fastest validating XML parser available for Python, and quite possibly anywhere..." David Mertz in an independent review also lauds PyRXP's speed but does not seem to have discovered its erroneous handling of characters. I think this is a good example of why benchmarking is a very slippery exercise. It's really inappropriate to even compare PyRXP to any other XML parser: it's not a conformant XML parser and thus not an XML parser at all. As many implementors tell you, it is often the odd corners of conformance that are behind the most significant performance losses. Standardization means we sacrifice some local optimization in order to gain flexibility and interoperability. By refusing to accept a very large class of quite valid XML instances, PyRXP rather does a disservice to the entire idea of XML. I have produced tools that do not fully conform to a target standard, but in such cases I follow the usual convention that such deviations are treated as bugs. I take a rather dim view of the situation in PyRXP given that

the developers have publicly refused to remedy the non-conformance; and
the developers trumpet the speed and low memory footprint of PyRXP, even though these advantages are only made possible by scorning conformance

I found threads discussing the development of the PyRXPU variant, which actually does seem to be XML conformant. As I expected, it is some two times less efficient in speed and memory footprint than PyRXP. The only difference is in proper treatment of Unicode, and this demonstrates my point about the cost of conformance. I have a lot of respect for the developers of PyRXP, and I hate to be so sharp about this matter, but I think it's quite serious and merits very unambiguous statement.

I'd also like to mention that if anyone is working on benchmarks of XML processing, which are useful if well done, that they run the tests on a variety of hardware and operating systems, and that they don't focus on a single XML file, but rather examine a variety of XML files. Numerous characteristics of XML files can affect parsing and processing speed, including:

The preponderance of elements versus attributes versus text (and even comments and processing instructions)
Any repetition of element or attribute names, values and text content
The distribution of white space
The character encoding
The use of character and general entities
The input source (in-memory, string, file, URL, etc.)

I do want to point out that I'm one of the developers of cDomlette, which one might consider a competing package. This might seem a temptation to take an especially hard line with competing tools, but then again in this column I have covered the likes of ElementTree, gnosis.xml.objectify, and libxml and have never before had such a fundamental problem with any package.

Conclusion

My recommendation is to consider PyRXPU, but to avoid plain PyRXP. I hope that the former version becomes the default so that this confusing situation can be resolved. PyRXPU produces a simple and highly Pythonic data structure, though one that might be a bit tricky to navigate correctly in code. It operates quickly and offers a low memory footprint.

Development activity seem to be picking up again in the Python-XML world. Peter Yared announced Python XML Marshaller 0.2, a new Python data binding for XML available under the PSF Python license. It includes some WXS support and can generate WXS from Python data structures for round-trip support. It also has some features for customizing the binding. See the announcement.

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Walter Dorwald announced XIST 2.4. Billed as an "object oriented XSLT", XIST uses an easily extensible, DOM-like view of source and target XML documents to generate HTML. This release features some API improvements, bug fixes, and a new find function for searching attributes. See the announcement.

Magnus Lie Hetland announced Atox 0.1 which allows you to write custom scripts for converting plain text into XML. You define the text to XML binding using a simple XML language. It's meant to be used from the command line. See the full announcement.

Arnold deVos announced GraphPath a little XPath-like language for analysing graph-structured data, especially RDF. The implementation is Python and works with rdflib or the Python binding of Redland. It includes a query evaluator and a goal-driven inference engine. I found this annoucement interesting because GraphPath is reminiscent of our early proposals while developing the Versa RDF query language at Fourthought. I think this is an important approach to RDF query and superior to the many SQL-like query languages. It's good to see more than one development along these lines.