XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Using SAX for Proper XML Output

Using SAX for Proper XML Output

March 12, 2003

In an earlier Python and XML column I discussed ways to achieve proper XML output from Python programs. That discussion included basic considerations and techniques in generating XML output in Python code. I also introduced a couple of useful functions for helping with correct output: xml.sax.saxutils.escape from core Python 2.x and Ft.Xml.Lib.String.TranslateCdata from 4Suite. There are other tools for helping with XML generation. In this article I introduce an important one that comes with Python itself. Generating XML from Python is one of the most common XML-related tasks the average Python user will face; thus, having more than one way to complete such a common task is especially helpful.

Pushing SAX

Probably the most effective general approach to creating safe XML output is to use SAX more fully than just cherry-picking xml.sax.saxutils.escape. Most users think of SAX as an XML input system, which is generally correct; because, however, of some goodies in Python's SAX implementation, you can also use it as an XML output tool. First of all, Python's SAX is implemented with objects which have methods representing each XML event. So any code that calls these methods on a SAX handler can masquerade as an XML parser. Thus, your code can pretend to be an XML parser, sending events from the serialized XML, while actually computing the events in whatever manner you require. On the other end of things, xml.sax.XMLGenerator, documented in the official Python library reference, is a utility SAX handler that comes with Python. It takes a stream of SAX events and serializes them to an XML document, observing all the necessary rules in the process.

You might have gathered from this description how this tandem of facilities leads to an elegant method for emitting XML, If not, listing 1 illustrates just how this technique may be used to implement the code pattern from the earlier XML output article (that is, creating an XML-encoded log file).

Listing 1 (listing1.py): Generating an XML log file using Python's SAX utilities
import time
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesNSImpl


class xml_logger:
    def __init__(self, output, encoding):
        Set up a logger object, which takes SAX events and outputs
        an XML log file
        logger = XMLGenerator(output, encoding)
        attrs = AttributesNSImpl({}, {})
        logger.startElementNS((None, u'log'), u'log', attrs)
        self._logger = logger
        self._output = output
        self._encoding = encoding

    def write_entry(self, level, msg):
        Write a log entry to the logger
        level - the level of the entry
        msg   - the text of the entry.  Must be a Unicode object
        #Note: in a real application, I would use ISO 8601 for the date
        #asctime used here for simplicity
        now = time.asctime(time.localtime())
        attr_vals = {
            (None, u'date'): now,
            (None, u'level'): LOG_LEVELS[level],
        attr_qnames = {
            (None, u'date'): u'date',
            (None, u'level'): u'level',
        attrs = AttributesNSImpl(attr_vals, attr_qnames)
        self._logger.startElementNS((None, u'entry'), u'entry', attrs)
        self._logger.endElementNS((None, u'entry'), u'entry')

    def close(self):
        Clean up the logger object
        self._logger.endElementNS((None, u'log'), u'log')

if __name__ == "__main__":
    #Test it out
    import sys
    xl = xml_logger(sys.stdout, 'utf-8')
    xl.write_entry(2, u"Vanilla log entry")

I've arranged the logic in a class that encapsulates the SAX machinery. xml_logger is initialized with an output file object and an encoding to use. First I set up an XMLGenerator instance which will accept SAX events and emit XML text. I immediately start using it by sending SAX events to initialize the document and create a wrapper element for the overall log. You should not forget to send startDocument. In opening the top-level element, logs, I use the namespace-aware SAX API, even though the log XML documents do not use namespaces. This is just to make the example a bit richer, since the namespace-aware APIs are more complex than the plain ones.

You ordinarily don't have to worry about how the instances of attribute information are created, unless you're writing a driver, filter, or any other SAX event emitter such as this one. Unfortunately for such users, the creation APIs for the AttributesImpl and AttributesNSImpl classes are not as well documented as the read APIs. It's not even clear whether they are at all standardized. The system used in the listing does work with all recent Python/SAX and PyXML SAX versions. In the case of the namespace-aware attribute information class, you have to pass in two dictionaries. One maps a tuple of namespace and local name to values, and the other maps the same to the qnames used in the serialization. This may seem a rather elaborate protocol, but it is designed to closely correspond to the standard read API for these objects. In the initializer in the listing I create an empty AttributesNSImpl object by initializing it with two empty dictionaries. You can see how this works when there are actually attributes by looking in the write_entry method.

Once the AttributesNSImpl object is ready, creating an element is a simple matter of calling the startElementNS method on the SAX handler using the (namespace, local-name), qname convention and attribute info object. And don't forget to call the the endElementNS method to close the element. In the initializer of xml_logger, closing the top-level element and document itself is left for later. The caller must call the close method to wrap things up and have well-formed output. The rest of the xml_logger class should be easy enough to follow.

The character of SAX

In the last article on XML output I walked through all the gotchas of proper character encoding. This SAX method largely frees you from the worry of all that. The most important thing to remember is to use Unicode objects rather than strings in your API calls. This follows the principle I recommended in the last article: In all public APIs for XML processing, character data should be passed in strictly as Python Unicode objects.

There are in fact a few areas where simple, ASCII only strings are safe: for example, output encodings passed to the initializer of XMLGenerator and similar cases. But these areas are unusual. Listing 2 demonstrates a use of the xml_logger class to output a more interesting log entry.

Listing 2: Using xml_logger to emit non-ASCII and escaped characters
from listing1 import xml_logger

import cStringIO

stream = cStringIO.StringIO()

xl = xml_logger(stream, 'utf-8')
xl.write_entry(2, u"In any triangle, each interior angle < 90\u00B0")
print repr(stream.getvalue())  

I use cStringIO to capture the output as a string. I then display the Python representation of the output in order to be clear about what is produced. The resulting string is basically (rearranged to display nicely here):

<?xml version="1.0" encoding="utf-8"?>
<log><entry level="ERROR" date="Sat Mar  8 08:55:11 2003">
in any triangle, each interior angle &lt; 90\xc2\xb0

You can see that the character passed in as "<" has been escaped to "&lt;" and that the character given using the Unicode character escape "\u00B0" (the degree symbol) is rendered as the UTF-8 sequence "\xc2\xb0". If I specify a different encoding for output, as in listing 3, the library will again handle things.

Listing 3: Using xml_logger to emit non-ASCII and escaped characters with ISO-8859-1 encoding
from listing1 import xml_logger

import cStringIO

stream = cStringIO.StringIO()

xl = xml_logger(stream, 'iso-8859-1')
xl.write_entry(2, u"In any triangle, each interior angle < 90\u00B0")
print repr(stream.getvalue())  

Which results in

<?xml version="1.0" encoding="iso-8859-1"?>
<log><entry level="ERROR" date="Sat Mar  8 09:35:56 2003">
In any triangle, each interior angle &lt; 90\xb0

If you use encodings which aren't in the Unicode or ISO-8859 family, or which are not available in the "encodings" directory of the Python library, you may have to download third-party codecs in order to use them in your XML processing. This includes the popular JIS, Big-5, GB, KS, and EUC variants in Asia. You may already have these installed for general processing; if not, it requires a significant amount of sleuthing right now to find them. Eventually they may be available all together in the Python Codecs project. For now you can download particular codecs from projects such as Python Korean Codecsand Tamito Kajiyama's Japanese codecs (page in Japanese).

Other Developments

The built-in SAX library is but one of the available tools for dealing with all the complexities of XML output. It has the advantage of coming with Python, but in future columns I will cover other options available separately. Another useful but less common SAX usage pattern is chaining SAX filters. Soon after this article is published, I'll have an article out with more information on using SAX filters with Python's SAX. Watch my publications list to see when it appears.

The past month or so has been another busy period for Python-XML development. There has been a lot of discussion of the future direction of the PyXML project. Martijn Faassen made " a modest proposal" for changing the fact that PyXML overwrites the xml module in a Python installation. This led to the Finding _xmlplus in Python 2.3a2 thread in which I proposed that parts of PyXML, pysax, and the dom package (excepting 4DOM) should simply be moved in to the Python core. Discussion of these matters is still proceeding, but if you are interested in the road map for PyXML, you might wish to join the discussion.


Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Francesco Garelli announced Satine, an interesting package which converts XML documents to Python lists of objects which have Python attributes mirroring the XML element attributes, a data structure he calls an "xlist". The package is designed for speed, with key parts coded in C. It also has a web services module which supports plain XML and SOAP over HTTP. Garelli would be grateful for contributors of binary packages on various platforms.

David Mertz announced the 1.0.6 release of gnosis XML tools. Most of the changes have to do with the gnosis.magic module, which isn't directly related to XML, but there are some XML bug fixes.

Mark Bucciarelli was having problems handling WSDL, which eventually led to his contributing a patch to wsdllib that makes it work with the most recent 4Suite. I'll release an updated version of wsdllib that incorporates this patch.

1 to 2 of 2
  1. Listing 1 Problems
    2003-11-13 09:44:08 J Chik
  2. Scaling...
    2003-03-14 10:26:32 Matthew Shomphe
1 to 2 of 2