Writing and Reading XML with XIST
XIST is a very interesting project I've been meaning to dig into for some time. If you've been following the news section at the end of each of these columns, you'll have noticed the steady work that Walter Dörwald, the project leader, has put into this toolkit. It started out as a framework for generating HTML and incidentally XML, but the XML facilities have steadily grown and matured, until it is now a sophisticated system for not only generating, but also processing, XML. As the legend on the project page says: "XIST is also a DOM parser (built on top of SAX2) with a very simple and Python-esque tree API. Every XML element type corresponds to a Python class and these Python classes provide a conversion method to transform the XML tree (e.g. into HTML). XIST can be considered 'object-oriented XSL'". XIST isn't one of those projects you hear loudly advocated and debated when Python/XML processing options come up, but it probably should be.
I'm using my own build of Python 2.4 on Fedora Core 3. I grabbed the latest XIST download (version 2.8). Turns out it requires a host of other packages as well. I installed the apparent minimum requirements: PyXML 0.8.4, ll-url 0.15 and ll-ansistyle 0.6. In all these cases the usual
python setup.py install worked, and so it was for the ll-xist package itself. I installed everything in this particular order, and yet I immediately noticed something amiss:
$ python Python 2.4 (#1, Dec 6 2004, 09:55:00) [GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import ll Traceback (most recent call last): File "<stdin>", line 1, in ? ImportError: No module named ll >>>
ll module is an umbrella over
ll.xist. I confirmed that there was indeed an "ll" directory in my Python "site-packages", but I noticed there was no
"__init__.py" in it, which explains the problems finding the package. Looking back over the output from installing the various
ll module components, I found some suspicious warnings:
[ll-url-0.15]$ python setup.py install [SNIP] running build_py package init file '__init__.py' not found (or not a regular file) creating build creating build/lib.linux-i686-2.4 creating build/lib.linux-i686-2.4/ll copying url.py -> build/lib.linux-i686-2.4/ll package init file '__init__.py' not found (or not a regular file) running build_ext [SNIP] [ll-ansistyle-0.6]$ python2.4 setup.py install [SNIP] running build_py package init file '__init__.py' not found (or not a regular file) creating build creating build/lib.linux-i686-2.4 creating build/lib.linux-i686-2.4/ll copying ansistyle.py -> build/lib.linux-i686-2.4/ll package init file '__init__.py' not found (or not a regular file) running build_ext [SNIP]
I checked the
INSTALL document again to see if I might have missed a step, but it didn't seem that way. It seemed like either an installer bug, or perhaps a missing package that needed to be installed in order to get the umbrella
ll module properly set up. Things seemed to work fine after I hacked in a
"__init__.py" by hand, but soon it became apparent that something was still missing. I browsed the project Web site, and guessed that perhaps I also needed the ll-core 0.2.1 package. This turned out to do the trick. I think the entire sequence of XIST prerequisites should be better documented in the README. In order to save other readers any confusion, here is the order of prerequisite installation I recommend, including minimum versions:
Building and Writing XML
XIST started out as an HTML or XML generator, so generating XML isn't a bad place to start with XIST. But it turns out that XIST's output mechanism isn't really stream-like; it's more DOM-like (though much richer than W3C DOM). It's a matter of building up the tree you have in mind, and then serializing the tree. For this reason it makes sense to first examine the XML tree building API.
XIST has an interesting approach to XML trees. It's sort of a hybrid between a DOM and a Data binding (see "XML Data Bindings in Python" for more on this distinction). But it's a different sort of hybrid than ElementTree. XIST's tree API is what I'd call "vocabulary-based", where each information item for each vocabulary is represented as a distinct Python class. You assemble instances of these classes to get the desired tree. Vocabularies in XIST are organized according to XML namespaces, such that
ll.xist.ns.docbook contains Python classes representing all the elements defined in Docbook. Yes, that's almost 600 classes. Some other common information items also have specialized classes, for example
ll.xist.ns.html.DocTypeXHTML10transitional, which represents the XHTML 1.0 transitional document type declaration (like the
Doctype class in standard DOM) and
ll.xist.ns.xml.XML10, which represents the standard XML declaration.
To explore XIST's XML output support I'll write code to generate a simple XML Software Autoupdate (XSA) file. XSA is an XML format for listing and describing software packages. This is the example I normally use to illustrate XML output, as in the article "Three More For XML Output". In XIST, you first have to define classes for the elements you're creating. Then you assemble them into a tree. Finally, you serialize the tree. Listing 1 is code to generate an XSA file.
Listing 1: Using XIST to Generate XSA
#Part One: Set up the classes for the elements from ll.xist import xsc #The XML "namespace" represents the basics of XML Infoset from ll.xist.ns import xml class xsa(xsc.Element): pass class vendor(xsc.Element): pass class name(xsc.Element): pass class email(xsc.Element): pass class product(xsc.Element): pass class version(xsc.Element): pass class last_release(xsc.Element): #The proper XML name is not a valid Python ID so you #have to explicitly map to the XML name from the Python #class name xmlname = "last-release" class changes(xsc.Element): pass #Nested classes are used to represent attributes class product(xsc.Element): class Attrs(xsc.Element.Attrs): class id(xsc.TextAttr): pass #Part Two: Create the document instance tree xsa_root = xsa( vendor( name(u"Centigrade systems"), email(u"email@example.com"), ), product( name(u"100\u00B0 Server"), version(u"1.0"), last_release(u"20030401"), changes(), id = u"100\u00B0" ) ) #Part Three: Serialize the tree #utf-8 encoding is actually the default print xsa_root.asBytes(encoding="utf-8")
I broke the listing into three parts. In part one, I set up the element types and other information items for XSA. Each XML element corresponds to a Python class deriving from
xsc.Element. The initializers of these classes allow for a simple and clever idiom for creating content and elements: positional arguments to the initializer become child nodes, and keyword arguments become attributes. By default, the class name matches the XML element name, but the naming rules are different between Python and XML. Listing 1 illustrates how to get around such mismatches.
The extra work in part one sets up a very natural convention for creating trees, demonstrated in part two. All I have to do to build the tree is create instances of the XSA element classes, all nested within the initializer calls. Part three is when I serialize the tree. The
asBytes method returns a string serialization of the tree. It properly encodes characters as needed, and deals with the non-ASCII degree symbol without any problems. Listing 2 shows the resulting output. The actual output is all on one line, but I have inserted line feeds for formatting reasons.
Listing 2: Output from Listing 1
<xsa><vendor><name>Centigrade systems</name> <email>firstname.lastname@example.org</email></vendor> <product id="100"> <name>100 Server</name> <version>1.0</version> <last-release>20030401</last-release><changes></changes> </product></xsa>
Pages: 1, 2