
Using SAX for Proper XML Output
In an earlier Python and XML
column I discussed ways to achieve proper XML output from Python
programs. That discussion included basic considerations and techniques in
generating XML output in Python code. I also introduced a couple of
useful functions for helping with correct output:
xml.sax.saxutils.escape from core Python 2.x and
Ft.Xml.Lib.String.TranslateCdata from 4Suite. There
are other tools for helping with XML generation. In this article I
introduce an important one that comes with Python itself. Generating XML
from Python is one of the most common XML-related tasks the average Python
user will face; thus, having more than one way to complete such a common
task is especially helpful.
Pushing SAX
Probably the most effective general approach to creating safe XML
output is to use SAX more fully than just cherry-picking
xml.sax.saxutils.escape. Most users think of SAX as an XML
input system, which is generally correct; because, however, of some
goodies in Python's SAX implementation, you can also use it as an XML
output tool. First of all, Python's SAX is implemented with objects which
have methods representing each XML event. So any code that calls these
methods on a SAX handler can masquerade as an XML parser. Thus, your code
can pretend to be an XML parser, sending events from the serialized XML,
while actually computing the events in whatever manner you require. On
the other end of things, xml.sax.XMLGenerator, documented in
the official Python library reference, is a utility SAX handler that
comes with Python. It takes a stream of SAX events and serializes them to
an XML document, observing all the necessary rules in the process.
You might have gathered from this description how this tandem of facilities leads to an elegant method for emitting XML, If not, listing 1 illustrates just how this technique may be used to implement the code pattern from the earlier XML output article (that is, creating an XML-encoded log file).
Listing 1 (listing1.py): Generating an XML log file using Python's SAX utilitiesimport time
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesNSImpl
LOG_LEVELS = ['DEBUG', 'WARNING', 'ERROR']
class xml_logger:
def __init__(self, output, encoding):
"""
Set up a logger object, which takes SAX events and outputs
an XML log file
"""
logger = XMLGenerator(output, encoding)
logger.startDocument()
attrs = AttributesNSImpl({}, {})
logger.startElementNS((None, u'log'), u'log', attrs)
self._logger = logger
self._output = output
self._encoding = encoding
return
def write_entry(self, level, msg):
"""
Write a log entry to the logger
level - the level of the entry
msg - the text of the entry. Must be a Unicode object
"""
#Note: in a real application, I would use ISO 8601 for the date
#asctime used here for simplicity
now = time.asctime(time.localtime())
attr_vals = {
(None, u'date'): now,
(None, u'level'): LOG_LEVELS[level],
}
attr_qnames = {
(None, u'date'): u'date',
(None, u'level'): u'level',
}
attrs = AttributesNSImpl(attr_vals, attr_qnames)
self._logger.startElementNS((None, u'entry'), u'entry', attrs)
self._logger.characters(msg)
self._logger.endElementNS((None, u'entry'), u'entry')
return
def close(self):
"""
Clean up the logger object
"""
self._logger.endElementNS((None, u'log'), u'log')
self._logger.endDocument()
return
if __name__ == "__main__":
#Test it out
import sys
xl = xml_logger(sys.stdout, 'utf-8')
xl.write_entry(2, u"Vanilla log entry")
xl.close()
I've arranged the logic in a class that encapsulates the SAX machinery.
xml_logger is initialized with an output file object and an
encoding to use. First I set up an XMLGenerator instance
which will accept SAX events and emit XML text. I immediately start using
it by sending SAX events to initialize the document and create a wrapper
element for the overall log. You should not forget to send
startDocument. In opening the top-level element,
logs, I use the namespace-aware SAX API, even though the log
XML documents do not use namespaces. This is just to make the example a
bit richer, since the namespace-aware APIs are more complex than the plain
ones.
You ordinarily don't have to worry about how the instances of attribute
information are created, unless you're writing a driver, filter, or any
other SAX event emitter such as this one. Unfortunately for such users,
the creation APIs for the AttributesImpl and
AttributesNSImpl classes are not as well documented as the read
APIs. It's not even clear whether they are at all standardized. The
system used in the listing does work with all recent Python/SAX and PyXML
SAX versions. In the case of the namespace-aware attribute information
class, you have to pass in two dictionaries. One maps a tuple of
namespace and local name to values, and the other maps the same to the
qnames used in the serialization. This may seem a rather elaborate
protocol, but it is designed to closely correspond to the standard read
API for these objects. In the initializer in the listing I create an empty
AttributesNSImpl object by initializing it with two empty
dictionaries. You can see how this works when there are actually
attributes by looking in the write_entry method.
Once the AttributesNSImpl object is ready, creating an
element is a simple matter of calling the startElementNS
method on the SAX handler using the (namespace, local-name),
qname convention and attribute info object. And don't forget to
call the the endElementNS method to close the element. In
the initializer of xml_logger, closing the top-level element
and document itself is left for later. The caller must call the
close method to wrap things up and have well-formed output.
The rest of the xml_logger class should be easy enough to
follow.
The character of SAX
In the last article on XML output I walked through all the gotchas of proper character encoding. This SAX method largely frees you from the worry of all that. The most important thing to remember is to use Unicode objects rather than strings in your API calls. This follows the principle I recommended in the last article: In all public APIs for XML processing, character data should be passed in strictly as Python Unicode objects.
There are in fact a few areas where simple, ASCII only strings are
safe: for example, output encodings passed to the initializer of
XMLGenerator and similar cases. But these areas are unusual.
Listing 2 demonstrates a use of the xml_logger class to
output a more interesting log entry.
from listing1 import xml_logger
import cStringIO
stream = cStringIO.StringIO()
xl = xml_logger(stream, 'utf-8')
xl.write_entry(2, u"In any triangle, each interior angle < 90\u00B0")
xl.close()
print repr(stream.getvalue())
I use cStringIO to capture the output as a string. I then
display the Python representation of the output in order to be clear about
what is produced. The resulting string is basically (rearranged to
display nicely here):
<?xml version="1.0" encoding="utf-8"?>
<log><entry level="ERROR" date="Sat Mar 8 08:55:11 2003">
in any triangle, each interior angle < 90\xc2\xb0
</entry></log>
You can see that the character passed in as "<" has been escaped to "<" and that the character given using the Unicode character escape "\u00B0" (the degree symbol) is rendered as the UTF-8 sequence "\xc2\xb0". If I specify a different encoding for output, as in listing 3, the library will again handle things.
Listing 3: Using xml_logger to emit non-ASCII and escaped characters with ISO-8859-1 encodingfrom listing1 import xml_logger
import cStringIO
stream = cStringIO.StringIO()
xl = xml_logger(stream, 'iso-8859-1')
xl.write_entry(2, u"In any triangle, each interior angle < 90\u00B0")
xl.close()
print repr(stream.getvalue())
Which results in
<?xml version="1.0" encoding="iso-8859-1"?>
<log><entry level="ERROR" date="Sat Mar 8 09:35:56 2003">
In any triangle, each interior angle < 90\xb0
</entry></log>
If you use encodings which aren't in the Unicode or ISO-8859 family, or which are not available in the "encodings" directory of the Python library, you may have to download third-party codecs in order to use them in your XML processing. This includes the popular JIS, Big-5, GB, KS, and EUC variants in Asia. You may already have these installed for general processing; if not, it requires a significant amount of sleuthing right now to find them. Eventually they may be available all together in the Python Codecs project. For now you can download particular codecs from projects such as Python Korean Codecsand Tamito Kajiyama's Japanese codecs (page in Japanese).
Other Developments
The built-in SAX library is but one of the available tools for dealing with all the complexities of XML output. It has the advantage of coming with Python, but in future columns I will cover other options available separately. Another useful but less common SAX usage pattern is chaining SAX filters. Soon after this article is published, I'll have an article out with more information on using SAX filters with Python's SAX. Watch my publications list to see when it appears.
The past month or so has been another busy period for Python-XML
development. There has been a lot of discussion of the future direction
of the PyXML
project. Martijn Faassen made "
a modest proposal" for changing the fact that PyXML overwrites the
xml module in a Python installation. This led to the
Finding _xmlplus in Python 2.3a2 thread in which I
proposed that parts of PyXML, pysax, and the dom package (excepting
4DOM) should simply be moved in to the Python core. Discussion of these
matters is still proceeding, but if you are interested in the road map for
PyXML, you might wish to join the discussion.
Also in Python and XML | |
Should Python and XML Coexist? | |
Francesco Garelli announced Satine, an interesting package which converts XML documents to Python lists of objects which have Python attributes mirroring the XML element attributes, a data structure he calls an "xlist". The package is designed for speed, with key parts coded in C. It also has a web services module which supports plain XML and SOAP over HTTP. Garelli would be grateful for contributors of binary packages on various platforms.
David Mertz announced the 1.0.6 release of gnosis XML tools. Most of the changes have to do with the gnosis.magic module, which isn't directly related to XML, but there are some XML bug fixes.
Mark Bucciarelli was having problems handling WSDL, which eventually led to his contributing a patch to wsdllib that makes it work with the most recent 4Suite. I'll release an updated version of wsdllib that incorporates this patch.
- Listing 1 Problems
2003-11-13 09:44:08 J Chik - Listing 1 Problems
2003-11-14 06:50:35 Uche Ogbuji - Listing 1 Problems
2003-11-15 06:15:14 J Chik - Listing 1 Problems
2003-11-28 09:42:13 J Chik - Scaling...
2003-03-14 10:26:32 Matthew Shomphe - Scaling...
2003-03-15 13:19:06 Uche Ogbuji