XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

PyRXP is a DTD validating XML parser developed by ReportLab. It is Python wrapper around RXP, a C parser developed by Richard Tobin and Henry Thompson of the Edinburgh Language Technology Group as the core of LT XML, "an integrated set of XML tools and a developers' tool-kit, including a C-based API". ReportLab is a vendor of database reporting software and very well known and respected in the Python community. PyRXP is a core component of many of ReportLab's open source and commercial components. PyRXP focuses on performance above all things by using a fast C parser and by strictly building a bare-bones Python structure of tuples and string buffers from XML source. RXP and PyRXP are both distributed under the GNU General Public License.

I downloaded the full tar/gzip distribution of PyRXP 0.9 for running on Python 2.3.2. Note: the archive does not create its own directory when unpacked, so you'll want to do so by hand:

$ mkdir pyRXP-0-9
$ cd pyRXP-0-9
$ tar zxvf ../pyRXP-0-9.tgz
[SNIP]
$ python setup.py install
[SNIP] 

Source XML for the documentation comes in the distribution, but I didn't see an obvious way to build it so I just downloaded the PDF documentation.

Character trouble in tag land

PyRXP builds a bare bones tuple-based Python structure from an XML instance. To get a flavor of this structure, I tried to parse the same document I've been using in recent explorations of Python-XML tools (Listing 1).

Listing 1: Sample XML file (labels.xml) containing address labels
<?xml version="1.0" encoding="iso-8859-1"?>
<labels>
  <label added="2003-06-20">
    <quote>
      <!-- Mixed content -->
      <emph>Midwinter Spring</emph> is its own season&#8230;
    </quote>
    <name>Thomas Eliot</name>
    <address>
      <street>3 Prufrock Lane</street>
      <city>Stamford</city>
      <state>CT</state>
    </address>
  </label>
  <label added="2003-06-10">
    <name>Ezra Pound</name>
    <address>
      <street>45 Usura Place</street>
      <city>Hailey</city>
      <state>ID</state>
    </address>
  </label>
</labels>  

My attempt was the code in listing 2:

Listing 2: Simple parse of XML in a file
import pyRXP
parser = pyRXP.Parser()
fobj = open('labels.xml').read()
#Introspection doesn't reveal any "parseFile"-like method
doc = parser.parse(fobj)  

The result of this attempt was rather hair raising:

$ python listing2.py
Traceback (most recent call last):
  File "listing2.py", line 4, in ?
    doc = parser.parse(fobj)
pyRXP.Error: Error: 0x2026 is not a valid 8-bit XML character
 in unnamed entity at line 6 char 61 of [unknown]
error return=1
0x2026 is not a valid 8-bit XML character
Parse Failed!  

The problem, besides the fact that the parser seemed to fail parsing a perfectly well-formed XML document, is that the error message is unhelpful. The phrase "valid 8-bit XML character" is meaningless. The XML character set is Unicode, with the restriction that some characters are not allowed. But there is no concept of "bits" in the idea of an XML character. Each character is merely an abstract code point. A character can be encoded into a storage format associated with a standard bit length such as UTF-8 (8 bit), but this really has nothing to do with the XML character model. To be fair, this and other concepts relating to Unicode can be rather arcane; but there are excellent resources to help clear things up, including Mike Brown's article "XML Tutorial--A reintroduction to XML with an emphasis on character encoding". For a very friendly discussion of Unicode focusing on the Python implementation there is " Unicode Support in Python (PDF)" by Marc-Andre Lemburg. I gather a lot of relevant notes on these matters in my Akara article "XML Character issues in Python".

At any rate, I pored over the PyRXP documentation expecting to find something I must have missed. I found a few properties that can be set on the parser and the closest I found was ExpandCharacterEntities. In effect it returns a character entity such as , the one in the sample document, as the literal sequence of seven separate characters, starting with the ampersand and ending with the semicolon. This is a serious violation of the basic principles of XML, in which is strictly one character rather than seven; further, it doesn't help me parse the sample file properly. I then checked the ReportLab mailing lists and found others who had run into the same problem. The responses from the developers were, more or less, that PyRXP raises a fatal error when presented with XML characters with Unicode ordinal greater than U+256, regardless of how they are represented. The unfortunate upshot of this is that PyRXP 0.9 is not an XML parser.

I only cover XML processing tools in this column; and, frankly, such a fundamental case of non-conformance would have been to my mind more than enough to disqualify PyRXP from discussion. Nevertheless, there was no way I was going to throw up my hands at this point. I have heard a lot of good things about PyRXP, and I'd like to be sure there is fair coverage of as broad a selection of Python-XML tools as possible. I pored through the docs again and found a bit that I'd overlooked the first time. Earlier on, in searching on whether users of the core C RXP parser also had this problem, I came across Norm Walsh's simple instruction to one such user: "I think you need to rebuild or reconfigure RXP with Unicode support. XML isn't 8-bit."

It turns out that the PyRXP developers have provided a start toward this. From the manual, "PyRXPU is the 16-bit Unicode aware version of pyRXP. It is currently only available the source distribution of pyRXP, since it is still 'alpha' quality. Please report any bugs you find with it."

It's still odd to tie the idea of bit width of a character encoding to the foundation of an XML parser (the phrase "16-bit Unicode" is almost as meaningless as "8-bit XML character") but PyRXPU seems well worth a try.

A Conformant Version of PyRXP?

It appears that, contrary to the note in the manual, PyRXPU is only available in CVS. I grabbed and built the CVS version like so:

$ cvs -d :pserver:anonymous@cvs.reportlab.sourceforge.net:/cvsroot/reportlab 
login
[SNIP]
$ cvs -d :pserver:anonymous@cvs.reportlab.sourceforge.net:/cvsroot/reportlab
co rl_addons/pyRXP
[SNIP]
$ cd rl_addons/pyRXP
$ python setup.py install
[SNIP] 

I just hit "Enter" at the "CVS password" prompt.

Listing 3: Simple parse of XML in a file, reprise
import pyRXPU
parser = pyRXPU.Parser()
fobj = open('labels.xml').read()
#Introspection doesn't reveal any "parseFile"-like method
doc = parser.parse(fobj)  

This time the parse is successful, and I was able to start digging into the resulting data structure as illustrated by jumping into the interpreter after running the script:

>>> import pprint
>>> pprint.pprint(doc)
(u'labels',
 None,
 [u'\n  ',
  (u'label',
   {u'added': u'2003-06-20'},
   [u'\n    ',
    (u'quote',
     None,
     [u'\n      \n      ',
      (u'emph', None, [u'Midwinter Spring'], None),
      u' is its own season\u2026\n    '],
     None),
    u'\n    ',
    (u'name', None, [u'Thomas Eliot'], None),
    u'\n    ',
    (u'address',
     None,
     [u'\n      ',
      (u'street', None, [u'3 Prufrock Lane'], None),
      u'\n      ',
      (u'city', None, [u'Stamford'], None),
      u'\n      ',
      (u'state', None, [u'CT'], None),
      u'\n    '],
     None),
    u'\n  '],
   None),
  u'\n  ',
  (u'label',
   {u'added': u'2003-06-10'},
   [u'\n    ',
    (u'name', None, [u'Ezra Pound'], None),
    u'\n    ',
    (u'address',
     None,
     [u'\n      ',
      (u'street', None, [u'45 Usura Place'], None),
      u'\n      ',
      (u'city', None, [u'Hailey'], None),
      u'\n      ',
      (u'state', None, [u'ID'], None),
      u'\n    '],
     None),
    u'\n  '],
   None),
  u'\n'],
 None) 

I knew that the result would be a structure of Python primitives; thus, as in the last article, I used the pprint module to produce a representation I could follow easily. It's easy to see the basic pattern: elements become tuples with the node name as the first (Unicode) item, a dictionary of attributes or None as the second, and a list of contents or None as the third. The fourth is reserved for customized use. This data structure is quite simple, which is one of the attractions of PyRXPU; but it might be a bit cumbersome to navigate in order to extract patterns of data, especially in comparison to data binding tools.

As you can see, all strings are Unicode objects, which is very good. From my understanding, using the production version of PyRXP you only get "classic" string objects, which I do not recommend mixing into XML processing. You can see the character that was giving the production version such fits, that \u2026. Here it is properly treated. Nevertheless, the strange bit about "16-bit Unicode" made me wonder whether there were also any such conformance problems in PyRXPU. Certainly XML allows numerous characters above code point 65535. The following is the relevant production from the XML 1.0 spec:

Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
         | [#x10000-#x10FFFF]  

The accompanying comment is "any Unicode character, excluding the surrogate blocks, FFFE, and FFFF." Note that this permissiveness will open up even more now that XML 1.1 has just become a full W3C recommendation. Some formerly forbidden characters including the range from #x1 through #x8 have been allowed, strictly in the form of character references.

I tested the treatment of very high Unicode characters in PyRXPU, and it does seem to handle them well enough. If you're an archaeologist with an interest in the Mycenaean culture you might have an interest in Unicode character U+10000, "LINEAR B SYLLABLE B008 A", which is used in the XML document parsed in the following snippet:

>>> import pyRXPU
>>> p = pyRXPU.Parser()
>>> p.parse("<spam>Very high Unicode char: &#x10000;</spam>")
(u'spam', None, [u'Very high Unicode char: \U00010000'], None)  

As you can see the character value becomes \U00010000 in Python. Python gets most Unicode matters right and deals with such high characters with aplomb whether you compile Python to store Unicode in 16 bits or 32 bits (again the bit width is not relevant to the Unicode character whatsoever but is merely a property of the chosen storage or encoding). It's good to have this confidence that PyRXPU is a conforming XML parser.

Benchmarks: A Lawyer's Best Friend

ReportLab bills PyRXP as "the fastest validating XML parser available for Python, and quite possibly anywhere..." David Mertz in an independent review also lauds PyRXP's speed but does not seem to have discovered its erroneous handling of characters. I think this is a good example of why benchmarking is a very slippery exercise. It's really inappropriate to even compare PyRXP to any other XML parser: it's not a conformant XML parser and thus not an XML parser at all. As many implementors tell you, it is often the odd corners of conformance that are behind the most significant performance losses. Standardization means we sacrifice some local optimization in order to gain flexibility and interoperability. By refusing to accept a very large class of quite valid XML instances, PyRXP rather does a disservice to the entire idea of XML. I have produced tools that do not fully conform to a target standard, but in such cases I follow the usual convention that such deviations are treated as bugs. I take a rather dim view of the situation in PyRXP given that

  1. the developers have publicly refused to remedy the non-conformance; and
  2. the developers trumpet the speed and low memory footprint of PyRXP, even though these advantages are only made possible by scorning conformance

I found threads discussing the development of the PyRXPU variant, which actually does seem to be XML conformant. As I expected, it is some two times less efficient in speed and memory footprint than PyRXP. The only difference is in proper treatment of Unicode, and this demonstrates my point about the cost of conformance. I have a lot of respect for the developers of PyRXP, and I hate to be so sharp about this matter, but I think it's quite serious and merits very unambiguous statement.

I'd also like to mention that if anyone is working on benchmarks of XML processing, which are useful if well done, that they run the tests on a variety of hardware and operating systems, and that they don't focus on a single XML file, but rather examine a variety of XML files. Numerous characteristics of XML files can affect parsing and processing speed, including:

  • The preponderance of elements versus attributes versus text (and even comments and processing instructions)
  • Any repetition of element or attribute names, values and text content
  • The distribution of white space
  • The character encoding
  • The use of character and general entities
  • The input source (in-memory, string, file, URL, etc.)

I do want to point out that I'm one of the developers of cDomlette, which one might consider a competing package. This might seem a temptation to take an especially hard line with competing tools, but then again in this column I have covered the likes of ElementTree, gnosis.xml.objectify, and libxml and have never before had such a fundamental problem with any package.

Conclusion

My recommendation is to consider PyRXPU, but to avoid plain PyRXP. I hope that the former version becomes the default so that this confusing situation can be resolved. PyRXPU produces a simple and highly Pythonic data structure, though one that might be a bit tricky to navigate correctly in code. It operates quickly and offers a low memory footprint.

Development activity seem to be picking up again in the Python-XML world. Peter Yared announced Python XML Marshaller 0.2, a new Python data binding for XML available under the PSF Python license. It includes some WXS support and can generate WXS from Python data structures for round-trip support. It also has some features for customizing the binding. See the announcement.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Walter Dorwald announced XIST 2.4. Billed as an "object oriented XSLT", XIST uses an easily extensible, DOM-like view of source and target XML documents to generate HTML. This release features some API improvements, bug fixes, and a new find function for searching attributes. See the announcement.

Magnus Lie Hetland announced Atox 0.1 which allows you to write custom scripts for converting plain text into XML. You define the text to XML binding using a simple XML language. It's meant to be used from the command line. See the full announcement.

Arnold deVos announced GraphPath a little XPath-like language for analysing graph-structured data, especially RDF. The implementation is Python and works with rdflib or the Python binding of Redland. It includes a query evaluator and a goal-driven inference engine. I found this annoucement interesting because GraphPath is reminiscent of our early proposals while developing the Versa RDF query language at Fourthought. I think this is an important approach to RDF query and superior to the many SQL-like query languages. It's good to see more than one development along these lines.



1 to 3 of 3

  1. 2004-05-25 14:03:09 
    First of all, the formatting for this article looks totally awful on mozilla browsers - I have to scroll 3x the window width to read it. What a headache. Had to cut 'n' past it into an editor.


    But thanks for the very informative post. I'll keep this in mind as I choose the applications I use with pyrxp.


    I agree, they should be more honest on the website about their performance, but it's an open-source project so it's not like there's a profit motive. It's just a marketing ploy to get more people to use the thing.


    If you have control over all possible inputs to your program, then pyrxp is the way to go. If you plan to run a web service and communicate via XML messages to other machines all over the world, you can't use pyrxp to parse the incoming messages. You have to use pyrxpU. Good to know!


    And nice that they have the same interface, so you can use the fast one until you run into problems, and then switch to pyrxpU when you need it.

  2. Discussion of this matter on Usenet
    2004-02-29 20:18:35 Uche Ogbuji
    I wanted to point out that there has been a good deal of discussion of this PyRXP conformance matter recently on Usenet.  See this thread.



    --Uche

  3. PyRXP unicode conformance
    2004-02-29 00:19:46 Arno Paehler
    I find the comments on "non-conformance"
    does not do justice to PyRXP.
    I use PyRXP extensively. My problem set
    does not contain ANY unicode and "conformance"
    hence is a non-issue.
    Memory footprint and speed in this case DO
    however matter, as I am dealing with
    processing roughly 50,000 files with on-disk
    sizes of up 450 MB for the largest file
    containing roughly 900,000 tags.
    Processing time of said file with an Athlon
    XP 1800+ and 1.5 GB memory was about 20 minutes
    and I do not dare to imagine what time any
    DO(o)Med implementation might take.
    Summary: you need conformance because you
    are an archaeologist, try something else.
    BUT PyRXP is an ooutstanding tools disqualified
    and trashed in a very unqualified way.
    • PyRXP XML non-conformance
      2004-02-29 07:38:04 Uche Ogbuji
      PyRXP do not conform to the XML 1.0 standard, although it claims to.  This is inimical to the very idea of standards.  Standards were not designed to satisfy your specific requirements on your specific machine. They are designed with tradeoffs for everyone in mind.  In XML one of the most important tradeoffs is that Unicode is the fundamental basis for XML, even though clearly processing Unicode is more expensive than processing more limited character sets.


      If you need performance greater than what XML can accommodate, you should not be using XML. Plain text parsing options are a couple of orders of magnitude more efficient than PyRXP, so why do you put up with even PyRXP's relative slowness and bloat?


      Your sentence talking about "archaeologists" seem to indicate hat you didn't even read the article. I ran into PyRXP's non-conformance while parsing a file that used ellipsis, which is a character, you'd probably have to admit, used by far more people than archaeologists. The specific example in which I selected a character that hapens to be of intrest to characters was when I was actually proving a positive of PyRXPU, the true XML parser in the Python/RXP family, against a specific corner case.


      I never indicated that expressing Linear B is likely to be a common need. But are you trying to minimize the need in the real world for expressing Arabic, Chinese, Japanese, Korean and even the many high unicode characters used in European language documents such as smart quotes, em and en dashes and ellipses? PyRXP cannot handle any of these.


      If you don't like the specifications that others make, then go invent your own (I myself have done this before and expect to do so again), but don't then try to confuse people about what you've done. There is already standard called XML and PyRXP does not conform to it, so no one should cause confuson by calling PyRXP an XML parser. I've done my bit to reduce the confusion by explaining the facts in detail. What you do with that information is your choice.


      --Uche

      • PyRXP XML non-conformance
        2004-06-24 15:30:16 
        I use PyRXP extensively so my problem set
        does not contain ANY unicode and "conformance"
        hence is a non-issue.Memory footprint and speed in this case DO however matter, as I am dealing with processing roughly 50,000 files with on-disk
        sizes of up 450 MB for the largest file
        containing roughly 900,000 tags.
        • PyRXP XML non-conformance
          2004-06-29 14:47:16 Uche Ogbuji
          As I pointed out in the article, if you are not using XML, then you don't need to deal with all the mess of angle brackets in the first place.


          I can easily process 450MB and even 900MB files 10 times faster than PyRXP can by writing usually a dozen lines of Python. When I want to actually use XML (which is defined by the W3C spec whether you like it or not) things become vastly more complex, and thus much slower.


          If all you're using is plain text, then just use plain text. If, however, you're using XML, which *is* Unicode (it is meaningless to say "there is no Unicode in my XML data), then use an XML tool. My main point in this article is that PyRXP is not an XML tool. Full stop.


          This is really not that subtle a point.


          Why peoeple insist on mucking around with pointy brackets and attributes awhen all they need is plain text CSV or INI-like format is beyond me, nor is it a question that interests me.


          --Uche

1 to 3 of 3