Menu

Gems from the Mines: 2002 to 2003

March 2, 2005

Uche Ogbuji

In previous articles, I have worked through the archives of the XML-SIG mailing list to gather some of the most useful bits of code and threads of discussion that are still relevant. The previous articles are "Gems From the Archives " (published Apr. 9, 2003) and "More Gems From the Mines " (published Nov. 12, 2003). In this article I continue where the last one left off, mining the XML-SIG archives for 2002 and 2003. As always, I have updated code where necessary to use current APIs, style, and conventions in order to make it more immediately useful to readers. All code listings are tested using Python 2.3.4 and PyXML 0.8.4.

A note on code style. Throughout this column I try to stick to the dictates of Python Enhancement Proposal (PEP) 8: Style Guide for Python Code, which is based on an older essay by Guido van Rossum, and formalized to instruct contributors to core Python code. I have also updated code by others in this article to meet these guidelines, which I call "PEP 8 style".

Tracking general entity references

In May 2003 Randall Nortman asked a common question:

So essentially what I'm asking is how do I get PyXML to preserve "é" as-is and output it in the same manner when I PrettyPrint() it? (Or, equivalently, convert it to its Unicode representation on input and back to an entity reference on output.)

"Mark E." provided a very uncommon response. The key message is in impenetrable HTML, but contains a very interesting code fragment which I've updated for this article. It demonstrates how one can use the Expat driver for Python SAX to track general entity references as they occur in the parse. It uses a trick that takes advantage of Expat's flexibility with the mind-bending XML 1.x rules about when a non-validating parser can choose to report entity references without resolving them. This is often called "skipping" the entities, and indeed Python SAX provides a means of capturing such skipped entities. Listing 1 is pyexpat-entity-preserve.py, a significantly updated version of Mark's demo code.

Listing 1 (pyexpat-entity-preserve.py): Use SAX and Expat to track entity references in XML

import cStringIO

from xml.sax.expatreader import ExpatParser

from xml.sax.handler import ContentHandler, EntityResolver

from xml.sax.handler import property_lexical_handler





class fake_dtd_parser(ExpatParser):

    #An expat parser class that fakes a DTD external subset

    #for XML documents that do not provide one

    def reset(self):

        ExpatParser.reset(self)

        parser = self._parser

        parser.UseForeignDTD(True)

        return





class skipped_entity_handler(ContentHandler,EntityResolver):

    def __init__(self):

        self.skipped_entities = []

        return



    def resolveEntity(self, publicId, systemId):

        #Called for all entoty resolution

        if systemId or systemId:

 return EntityResolver.resolveEntity(self,publicId,systemId)

        #Because UseForeignDTD flag is set, if there is no

        #external subset, Expat will call resolveEntity

        #with(publicId, systemId) = (None, None)

        #In this case, create a dummy, empty external

        #subset for Expat

        dummy = cStringIO.StringIO()

        return dummy



    def skippedEntity(self, name):

        #Called for all entities whose definitions

        #are unknown

        self.skipped_entities.append(name)

        return





def get_skipped_entities(xml_string):

    stream = cStringIO.StringIO(xml_string)

    parser = fake_dtd_parser()

    handler = skipped_entity_handler()

    parser.setContentHandler(handler)

    parser.setEntityResolver(handler)

    parser.parse(stream)

    stream.close()

    return handler.skipped_entities





if __name__ == "__main__":

XML = "<spam>Entity one: &eacute; Entity two: &eggs;</spam>"

    entities = get_skipped_entities(XML)

    print "Skipped entity names", entities

First of all, the code uses SAX, but forces PyExpat as the back end parser. The class fake_dtd_parser is a specialization of this parser that sets the UseForeignDTD, which instructs Expat to always try to load an external DTD subset, even if the source XML doesn't specify one. Expat tries to load this DTD by raising an event that gets turned into a SAX resolveEntity event where both the publicId and systemId parameters have been set to None. In order to do the right thing for this case, you have to use a SAX handler, skipped_entity_handler in this case, that knows how to deal with such special external DTD entity requests. skipped_entity_handler merely returns empty content as the supposed external DTD. As a side effect, since now Expat behaves as if an external DTD subset has been specified, it treats entity references it doesn't know about as skipped rather than as well-formed-ness errors.

Mark mentions a problem setting UseForeignDTD to True without turning off the SAX feature feature_external_ges. It's worth clarifying that the real issue is handling the None parameters to resolveEntity, triggered by UseForeignDTD. In my updated code, I handle this properly by overriding resolveEntity in skipped_entity_handler.

See the sidebar for the range of documentation links that cover all the SAX paraphernalia used in this code. I tested with Python 2.3.4, with and without pyXML 0.8.4. The UseForeignDTD flag was introduced in PyXML 0.8.0 and Python 2.3. The output is as follows:

Skipped entity names [u'eacute', u'eggs']

Documentation links for the zoo of SAX quirks

The various quirks relevant to the entity tracker recipe in Listing 1 are covered in documentation that is unfortunately quite scattered throughout the Python standard library reference. Here are the most important links.

In addition, I try to maintain updated information on Python/SAX in my posting Basic SAX processing.

Also of Interest

Soon after the RELAX NG schema language made its debut, the indefatigable Andrew Kuchling announced (January 2002) a straightforward implementation of James Clark's validation algorithm to Python. See other posts by Andrew later on that month for more on his progress. This effort unfortunately never became a full-blown RELAX NG validation engine, but it is still very useful code that should not be lost to posterity. It can be accessed in the PyXML CVS sandbox. Meanwhile the cause of RELAX NG in Python was taken up later by Eric van der Vlist, author of the O'Reilly Media book RELAX NG. Eric released XVIF, a validation workflow tool that includes a mostly complete implementation of RELAX NG.

An important feature of RELAX NG is its support for data type libraries that can be easily plugged in (as opposed to, say the monolithic data types of W3C XML Schema). Andrew Kuchling's RELAX NG efforts also led him to propose a data type library interface for Python (February 2002). Eric van der Vlist resurrected this discussion in September 2002, this time with more follow-on discussion. See also the thread entitled Some thoughts about types libraries.

Andrew Dalke offered an adapter he wrote for SAX drivers that do not support namespaces (January 2002). It translates plain element event (e.g. startElement) calls to the namespace-based (e.g. startElementNS) events. If you have need for it, you should probably rewrite it as a SAX filter (ask on the XML-SIG for more details), but I expect that given the current state of tools, most users are unlikely to need such a module, so I did not put any effort into improving the interface in this article.

In March 2002 Dinu Gherman needed a simple way to identify the text of an XML declaration. After hearing a lot of advice on how to do so using proper XML parser architecture, he decided on a hack to guess the information. It's always better to use proper XML facilities when you can, but you never know when a hack is the only way to go.

Martijn Faassen kicked off a long discussion on the future of PyXML by complaining about the "_xmlplus hack" that PyXML uses to serve as a drop-in replacement for the Python built-in XML libraries. After he reiterated the complaint the discussion turned to a very serious one that underscored the odd in-between status of PyXML, and in what ways the PyXML package continued to be relevant as a stand-alone. Most of these issues are still not resolved, so this thread is an important airing of considerations that affect many Python/XML users to this day.

In May 2002 Andrew Clover graced the list with a very thorough analysis documenting DOM bugs. He presented a table of versions of the various DOM-like libraries in Python's standard library and PyXML, and the quirks he ran into with each with regard to DOM API. The document is now published by the XML-SIG as DOM Standards compliance. Andrew Clover also put together some notes for updating the Python DOM API to cover DOM Level 3.

In October 2003 Roman Kennke posted some functions implementing utility methods from DOM Level 3. This module is most readily available in the archived attachments of XML-SIG, domhelper.py.

Marginalia

XML-SIG, as is natural for any active mailing list, often spills into somewhat off-topic discussion. Some of these topics are of great interest and worth preserving from the archives, even though they may not be immediately useful in solving day-to-day Python-XML development problems. One topic that kept on coming up was mainstream Web services versus REST. It was certainly not all idle banter. Important outputs of the discussion included dueling REST API proposals by Paul Prescod and Andrew Kuchling. Overall, these notes were part of February 2002's long and wide-ranging discussion on the merits and demerits of SOAP, WSDL and other Web services technologies. Eugene Eric Kim put together a marvelous summary of this discussion in the form of a "dialog map".

In Issues with Unicode type (September 2002) Eric van der Vlist points out some problems with Python's Unicode handling methods when dealing with code points represented in UTF-16 with surrogate pairs. The discussion gets pretty technical, as all Unicode discussions do, but it touches on important issues that obtain across Unicode-aware applications. It's worth pointing out that at present most Python distributors are shipping Python with the UCS-4 option, which provides for clean (if less efficient) Unicode storage. As a result, many of the problems discussed are becoming rarer in practice. Marc-Andre Lemburg graced the discussion with a draft PEP: Unicode Indexing Helper Module.

The topic of XML shorthand notations came up several times on XML-SIG. People wanted a way to efficiently author XML by hand. One of the most involved threads on this topic was launched with Scott Sweeney's announcement of SLiP and SLIDE - a quick XML shorthand syntax and tool for editing. Later on, Bud Bruegger followed up on the effort, eventually leading to a ezex, his own XML shorthand prototype. For more on such XML shorthand formats, see Paul Tchistopolskii's catalog Alternatives to XML. Look for the heading "XML shorthands".

Wrap Up

As I have chronicled developments in the XML-SIG, I've noticed a clear pattern of events. The early days, from 1998 through 2000 were notable for a fevered exchange of ideas and code snippets. There was a lot of useful code I found to present, but much of it needed updating. The 2000-2001 period was characterized by the process of establishing XML standard libraries for Python. There was less frantic creativity and more focus on pedestrian details. The 2002-2003 period was dominated by bug reports and fixes, and advanced techniques. There were fewer fresh code gems to preserve, but a lot of important threads to highlight.

Back to 2005: Fredrik Lundh has released a C implementation of ElementTree, which I covered in an earlier article. This package takes advantage of C to offer high performance and low memory footprint. cElementTree is now in version 1.0.1. See the announcement.

Adam Souzis announced Rx4RDF and Rhizome 0.4.3. Rx4RDF can be used for querying, transforming and updating RDF by specifying a "deterministic mapping" of the RDF model to the XML data model defined by XPath. Rhizome is a Wiki-like content management and delivery system built on Rx4RDF. The main update is limited RDF schema support, and there are many other minor improvements and fixes. See the announcement.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

It has been a big month for Python tools bridging EDI and XML. Jeremy Jones published an interesting article, Processing EDI Documents into XML with Python. In the article he demonstrates how to use Python text processing techniques (inspired by Dr. David Mertz's excellent book) to parse EDI and generate a Python DOM representing the data. I also found John Holland's pyx12. Pyx12 is a HIPAA X12 document validator and converter. It parses an ANSI X12N data file and validates it against the Implementation Guidelines for a HIPAA transaction. By default, it creates a 997 response. It can create an HTML representation of the X12 document or translate to an XML representation of the data file. See the announcement of version 1.1.0. If you know of other Python/EDI resources I haven't covered, please leave a comment.

I updated the Amara XML Toolkit to version 0.9.4. Amara is a toolkit for XML processing using natural Python idioms. It was the subject of my previous article. Besides improvements to the richness of the model, and assorted fixes, the main change is support for type inference, an optional feature which translates XML nodes to Python data types such as int, float and datetime, where it makes sense to do so. See the announcement.

J. David Ibáñez released itools 0.5.0, a collection of utilities. It includes the module itools.xml, which the announcement tags with the mysterious description: "XML infrastructure, includes resource handlers for XML, XHTML and HTML documents. Plus the Simple Template Language." The Web page doesn't elaborate much further. See the announcement.