Three More For XML Output

October 15, 2003

This column has touched on some advanced XML processing topics, but I keep coming back to basics. The reason for this is that the two most common XML processing tasks for Python users are to extract particular data fields from XML files and to generate XML in order to feed another program. I first laid out the basics and the pitfalls. Then I covered how to use the built-in SAX tools for XML output. Because of bugs and inconsistencies between versions, I don't advise using the SAX XMLGenerator for XML output, although XMLFilter, a new package I mention at the end of this article, might help make it a safer option. In my article "Gems From the Archives" I dusted off XMLWriter, a very nice XML output class deveoped by Lars Marius Garshol. I really like this latter option, and I think that someone should work it into the Python standard library.

In this article I continue the hunt for XML output tools, introducing three. First I'll demonstrate that you don't have to have any interest whatsoever in XSLT in order to take advantage of the fact that the W3C XSL working group thought long and hard about XML output considerations .

4Suite's XSLT Writers

4XSLT, part of 4Suite, implements all of the XSLT specification's output requirements, which covers pretty much all the most common output scenarios for XML and even HTML. The API which XSLT processors uses for output was designed to follow the texture of the XSLT specification with respect to result tree processing. Oddly enough, I never thought to use this API for more general output generation until recently, but the more I use it, the more I think that building indirectly on the Working Group's efforts in this way has resulted in a very straightforward, yet flexible output tool.

All you need to do is download 4Suite and run python setup.py install to get going, though you might skim the detailed install instructions on the project page. Listing 1 is an example of the API and generates a simple XML Software Autoupdate (XSA) file. XSA is the same XML data format used as example in the "Gems from the Archives" article, a format for listing and describing software packages.

Listing 1: Using the 4XSLT writer to generate XSA

import sys

from Ft.Xml.Xslt.XmlWriter import XmlWriter

from Ft.Xml.Xslt.OutputParameters import OutputParameters



oparams = OutputParameters()

oparams.doctypeSystem = u'http://www.garshol.priv.no/download/xsa/xsa.dtd'

oparams.doctypePublic = u'-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML'

oparams.indent = 'yes'



#Use default output parameters, and write to the console

writer = XmlWriter(oparams, sys.stdout)

writer.startDocument()

writer.startElement(u'xsa')

writer.startElement(u'vendor')

#Element with simple text (#PCDATA) content

writer.startElement(u'name')

writer.text(u'Centigrade systems')

writer.endElement(u'name')

writer.startElement(u'email')

writer.text(u"info@centigrade.bogus")

writer.endElement(u'email')

writer.endElement(u'vendor')

#Element with an attribute

writer.startElement(u'product')

writer.attribute(u'id', u"100\u00B0")

writer.startElement(u'name')

writer.text(u"100\u00B0 Server")

writer.endElement(u'name')

writer.startElement(u'version')

writer.text(u"1.0")

writer.endElement(u'version')

writer.startElement(u'last-release')

writer.text(u"20030401")

writer.endElement(u'last-release')

#Empty element

writer.startElement(u'changes')

writer.endElement(u'changes')

writer.endElement(u'product')

writer.endElement(u'xsa')

writer.endDocument()

The API is rather meticulous and verbose. You spell out every start element, end element event, and every atribute separately, etc. I have considered adding some shortcuts to this API for use in direct output, but this would be but a minor convenience and is probably a low-priority task.

Non-ASCII output is cleanly and correctly handled, as are all other escaping and output conformance tasks. A significant amount of the expensive work for this conformance is written in C for maximum performance. Use OutputParameters to set the document type declaration and to request that the output be indented. There are other parameters you can set, as established by the XSLT spec:

method -- "xml" (the default), "html" or "text". Sets the output format. For example, if "html", certain elements will be treated in a browser-friendly (but not well-formed XML) way. Rather than setting output method using this variable, you should just pick the appropriate writer class (see below).
encoding -- the character encoding to use (default UTF-8). The writer will automaticaly use character entities where necessary.
omitXmlDeclaration -- "yes" to suppress output of the XML declaration. Default "no"
standalone -- "yes" to set standalone in the XML declaration.
mediaType -- sets the media type of the output. You'll probbaly never need this.
cdataSectionElements -- a list of element names whose output will be wrapped in a CDATA section. This can provide for friendlier output in some cases.

Ft.Xml.Xslt.XmlWriter is just one of the writer objects provided by 4XSLT. According to your needs you can substitute one of the following:

HtmlWriter -- Use HTML output rules
PlainTextWriter -- Use plain text output rules (e.g. no elements)
SaxWriter -- Write to a SAX handler
DomWriter -- Create a DOM instance from the output. Use the getResult() method on the writer to retrieve the resulting node.

All of these descend from NullWriter, which does nothing with output requests, but which could be subclassed to provide other specialized writers using the same API. NullWriter is also the best reference for the writer API, which includes methods in addition to those used in listing 1, such as processingInstruction, comment and namespace (for full power in emitting namespace declarations).

Python xmlprinter

The project home page says it as crisply as you please: "xmlprinter is a simple, lightweight module to help write out XML documents, helping ensure proper quoting and balanced tags. The idea is grabbed from Perl's XML::Writer module." Download the package (available in Bzip2 format only), unpack and install with the usual "python setup.py install". xmlprinter is open source under the GPL.

Unfortunately, out of the box xmlwriter has some serious bugs with regard to output conformance. For one thing, it hardcodes the character data output encoding as "UTF-8", even if you choose a different encoding in the XML declaration. In general, it does not accommodate Unicode passed into the API even though this is probably the only sane way to emit XML (see my first article on the topic for a fuller discussion of why). The package requires Python 2.2 or better, so the fix for the immediate problems I found are easy enough. There are aspects of the package that I like, so rather than giving up I worked up a patch for it, in listing 2:

Listing 2: Patch to support Unicode and fix encoding bug in xmlprinter




--- xmlprinter.py.old   2003-10-14 07:51:26.000000000 -0600

+++ xmlprinter.py       2003-10-14 07:57:35.000000000 -0600

@@ -55,6 +55,7 @@

 __version__  = "0.1.0"

 __revision__ = "$Id: xmlprinter.py,v 1.6 2002/08/31 08:27:25 ftobin Exp $"

  

+import codecs

  

 class WellFormedError(Exception):

     pass

@@ -90,6 +91,8 @@

         if self._past_decl:

             raise WellFormedError, "past allowed point for XML declaration"

          

+        wrapper = codecs.lookup(encoding)[3]

+        self.fp = wrapper(self.fp)

         self.fp.write('<?xml version=%s encoding=%s?>\n'

                       % (quoteattr(self.xml_version),

                          quoteattr(encoding)))

@@ -142,7 +145,7 @@

         if not self._inroot:

             raise WellFormedError, "attempt to add data outside of root"

  

-        self.fp.write(escape(data).encode('UTF-8'))

+        self.fp.write(escape(data))

  

  

     def emptyElement(self, name, attrs={}):

With this patch in place I was able to get the XSA output test to work with xmlprinter, presented in Listing 3.

Listing 3: Using xmlwriter to generate XSA

import sys, codecs

import xmlprinter



xp = xmlprinter.xmlprinter(sys.stdout)

xp.startDocument()

xp.notationDecl(

    'xsa',

    u'-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML',

    u'http://www.garshol.priv.no/download/xsa/xsa.dtd'

    )

#Notice: there is no error checking to ensure that the root element

#specified in the doctype decl matches the top-level element generated

xp.startElement(u'xsa')

#Another element with child elements

xp.startElement(u'vendor')

#Element with simple text (#PCDATA) content

xp.startElement(u'name')

xp.data(u'Centigrade systems')

#Close currently open element ('name')

xp.endElement()

xp.startElement(u'email')

xp.data(u'info@centigrade.bogus')

xp.endElement()

#Element with an attribute

xp.startElement(u'product', {u'id': u'100\u00B0'})

xp.startElement(u'name')

xp.data(u'100\u00B0 Server')

xp.endElement()

xp.startElement(u'version')

xp.data(u'1.0')

xp.endElement()

xp.startElement(u'last-release')

xp.data(u'20030401')

xp.endElement()

#Empty element

xp.emptyElement(u'changes')

xp.endElement()

xp.endElement()

This API is very close to that of Garshol's XMLWriter. Unfortunately, since it could use more work, it hasn't seen any development since September, 2002.

JAXML

JAXML is a GPL module for generation of XML, XHTML or HTML documents. To install it, download the tarball (tar/gzip the only format I found), unpack and install with the usual "python setup.py install".

In the "Gems from the Archives" article I pointed to a delightfully twisted concept by Greg Stein for how to really twist Python syntax into XML output commands. His example is the following snippet to generate a bit of XHTML:


f = Factory()

body = f.body(bgcolor='#ffffff').p.a(href='l.html').img(src='l.gif')

html = f.html[f.head.title('title'), body]

JAXML is pretty close to a realization of this idea. As an example, listing 4 is the equivalent of Stein's example:

Listing 4: Using JAXML to generate XHTML

import sys, codecs

import jaxml



doc = jaxml.XML_document()

html = doc.html()

doc._push()

html.head().title('title')

doc._pop()

doc.body(bgcolor='#ffffff').p().a(href='l.html').img(src='l.gif')

print doc

Method invocation on the special JAXML "tag" objects create child objects. Text parameters become child text and keyword parameters become attributes. JAXML generates elements in a nested fashion as the special methods are invoked. This means that you have to have a way to wind back up the element stack if you want sequential elements. The _push() method allows for this, saving a current "location" for adding elements, which you then jump back to using the _pop() method. The result is:




<?xml version="1.0" encoding="iso-8859-1"?>

<html>

    <head>

        <title>title</title>

    </head>

    <body bgcolor="#ffffff">

        <p>

            <a href="l.html">

                <img src="l.gif" />

            </a>

        </p>

    </body>

</html>

JAXML manages to work support for XML namespaces into this general idea, which is impressive. Unfortunately, after a lot of tinkering I was not able to use JAXML to emit the XSA sample I've been using. I ran into problems trying to create the "last-release" element because that is not a legal identifier name in Python. And I ran into a lot of trouble trying to get it to handle the non-ASCII text. There seem to be some experiments with non-ASCII in the test.py file that comes with JAXML, and I was able to tinker until I got the degree character into the element text, but I wasn't able to figure out how to get it into an attribute. These matters might come down to documentation, and I might have missed something, but my first impression is that this really cool idea might need a bit more incubation before it's ready for industrial use.

Choice is a good thing. 4XSLT has more sophisticated output rules than Garshol's XMLWriter or the other options I looked at in this article. But you may not want to install such a big package as 4Suite just to write out some XML. If you're using XML lightly, the smaller packages will probably suit your needs just fine, and I suspect that JAXML wold suit the tastes of some Python developers very well. Also you may be limited in platforms if you use 4XSLT (tested under Windows and most Unix variants but all bets are off for Pippy, mainframe ports and the like). But if you're using XML heavily, then you're probably best off considering a comprehensive package such as 4Suite or the Python bindings to libxml/libxslt anyway. Luckily, experimenting with all these packages is very easy, so you should be able to quickly determine which one fits your groove.

Python-XML Roundup

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

It was a light month in Python-XML activity. XMLFilter is one of those great examples of a unglamorous but extremely valuable program. Based on its description (and I expect to try it out and report on it in this column soon), it is a must-have for anyone building SAX programs. It provides a fallback SAX parser/driver to avoid SAXReaderNotAvailable errors that users encounter on some platforms. It also offers a safety net against the XMLGenerator bug that bit me earlier in this series. Its main feature, however, is a framework for SAX filters. See Andrew Shearer's announcement.

Xmldiff 0.6.4 was released. "Xmldiff is a Python tool that figures out the differences between two similar XML files, in the same way the diff utility does for text files. The output can use a home brewed format or XUpdate". This is primarly a bug-fix release. See Alexandre Fayolle's announcement and the follow-up with corrected download URL.

I released Anobind 0.6.0. Anobind was the topic of the last article in this column. There is some internal restructuring in this release as well as new namespace support and a whitespace stripping. See my announcement