Writing and Reading XML with XIST

March 16, 2005

XIST is a very interesting project I've been meaning to dig into for some time. If you've been following the news section at the end of each of these columns, you'll have noticed the steady work that Walter Dörwald, the project leader, has put into this toolkit. It started out as a framework for generating HTML and incidentally XML, but the XML facilities have steadily grown and matured, until it is now a sophisticated system for not only generating, but also processing, XML. As the legend on the project page says: "XIST is also a DOM parser (built on top of SAX2) with a very simple and Python-esque tree API. Every XML element type corresponds to a Python class and these Python classes provide a conversion method to transform the XML tree (e.g. into HTML). XIST can be considered 'object-oriented XSL'". XIST isn't one of those projects you hear loudly advocated and debated when Python/XML processing options come up, but it probably should be.

Installation

I'm using my own build of Python 2.4 on Fedora Core 3. I grabbed the latest XIST download (version 2.8). Turns out it requires a host of other packages as well. I installed the apparent minimum requirements: PyXML 0.8.4, ll-url 0.15 and ll-ansistyle 0.6. In all these cases the usual python setup.py install worked, and so it was for the ll-xist package itself. I installed everything in this particular order, and yet I immediately noticed something amiss:

$ python

Python 2.4 (#1, Dec  6 2004, 09:55:00)

[GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2

Type "help", "copyright", "credits" or "license" for more

 information.

>>> import ll

Traceback (most recent call last):

  File "<stdin>", line 1, in ?

ImportError: No module named ll

>>>

The ll module is an umbrella over ll.url, ll.ansistyle and ll.xist. I confirmed that there was indeed an "ll" directory in my Python "site-packages", but I noticed there was no "__init__.py" in it, which explains the problems finding the package. Looking back over the output from installing the various ll module components, I found some suspicious warnings:

[ll-url-0.15]$ python setup.py install

[SNIP]

running build_py

package init file '__init__.py' not found (or not a

 regular file)

creating build

creating build/lib.linux-i686-2.4

creating build/lib.linux-i686-2.4/ll

copying url.py -> build/lib.linux-i686-2.4/ll

package init file '__init__.py' not found (or not a

 regular file)

running build_ext

[SNIP]

[ll-ansistyle-0.6]$ python2.4 setup.py install

[SNIP]

running build_py

package init file '__init__.py' not found (or not a

 regular file)

creating build

creating build/lib.linux-i686-2.4

creating build/lib.linux-i686-2.4/ll

copying ansistyle.py -> build/lib.linux-i686-2.4/ll

package init file '__init__.py' not found (or not a

 regular file)

running build_ext

[SNIP]

I checked the INSTALL document again to see if I might have missed a step, but it didn't seem that way. It seemed like either an installer bug, or perhaps a missing package that needed to be installed in order to get the umbrella ll module properly set up. Things seemed to work fine after I hacked in a "__init__.py" by hand, but soon it became apparent that something was still missing. I browsed the project Web site, and guessed that perhaps I also needed the ll-core 0.2.1 package. This turned out to do the trick. I think the entire sequence of XIST prerequisites should be better documented in the README. In order to save other readers any confusion, here is the order of prerequisite installation I recommend, including minimum versions:

Building and Writing XML

XIST started out as an HTML or XML generator, so generating XML isn't a bad place to start with XIST. But it turns out that XIST's output mechanism isn't really stream-like; it's more DOM-like (though much richer than W3C DOM). It's a matter of building up the tree you have in mind, and then serializing the tree. For this reason it makes sense to first examine the XML tree building API.

XIST has an interesting approach to XML trees. It's sort of a hybrid between a DOM and a Data binding (see "XML Data Bindings in Python" for more on this distinction). But it's a different sort of hybrid than ElementTree. XIST's tree API is what I'd call "vocabulary-based", where each information item for each vocabulary is represented as a distinct Python class. You assemble instances of these classes to get the desired tree. Vocabularies in XIST are organized according to XML namespaces, such that ll.xist.ns.docbook contains Python classes representing all the elements defined in Docbook. Yes, that's almost 600 classes. Some other common information items also have specialized classes, for example ll.xist.ns.html.DocTypeXHTML10transitional, which represents the XHTML 1.0 transitional document type declaration (like the Doctype class in standard DOM) and ll.xist.ns.xml.XML10, which represents the standard XML declaration.

To explore XIST's XML output support I'll write code to generate a simple XML Software Autoupdate (XSA) file. XSA is an XML format for listing and describing software packages. This is the example I normally use to illustrate XML output, as in the article "Three More For XML Output". In XIST, you first have to define classes for the elements you're creating. Then you assemble them into a tree. Finally, you serialize the tree. Listing 1 is code to generate an XSA file.

Listing 1: Using XIST to Generate XSA

#Part One: Set up the classes for the elements



from ll.xist import xsc

#The XML "namespace" represents the basics of XML Infoset

from ll.xist.ns import xml



class xsa(xsc.Element): pass



class vendor(xsc.Element): pass



class name(xsc.Element): pass



class email(xsc.Element): pass



class product(xsc.Element): pass



class version(xsc.Element): pass



class last_release(xsc.Element):

    #The proper XML name is not a valid Python ID so you

    #have to explicitly map to the XML name from the Python

    #class name

    xmlname = "last-release"



class changes(xsc.Element): pass



#Nested classes are used to represent attributes

class product(xsc.Element):

    class Attrs(xsc.Element.Attrs):

        class id(xsc.TextAttr): pass





#Part Two: Create the document instance tree



xsa_root = xsa(

    vendor(

        name(u"Centigrade systems"),

        email(u"info@centigrade.bogus"),

    ),

    product(

        name(u"100\u00B0 Server"),

        version(u"1.0"),

        last_release(u"20030401"),

        changes(),

        id = u"100\u00B0"

    )

)





#Part Three: Serialize the tree



#utf-8 encoding is actually the default

print xsa_root.asBytes(encoding="utf-8")

I broke the listing into three parts. In part one, I set up the element types and other information items for XSA. Each XML element corresponds to a Python class deriving from xsc.Element. The initializers of these classes allow for a simple and clever idiom for creating content and elements: positional arguments to the initializer become child nodes, and keyword arguments become attributes. By default, the class name matches the XML element name, but the naming rules are different between Python and XML. Listing 1 illustrates how to get around such mismatches.

The extra work in part one sets up a very natural convention for creating trees, demonstrated in part two. All I have to do to build the tree is create instances of the XSA element classes, all nested within the initializer calls. Part three is when I serialize the tree. The asBytes method returns a string serialization of the tree. It properly encodes characters as needed, and deals with the non-ASCII degree symbol without any problems. Listing 2 shows the resulting output. The actual output is all on one line, but I have inserted line feeds for formatting reasons.

Listing 2: Output from Listing 1

<xsa><vendor><name>Centigrade systems</name>

<email>info@centigrade.bogus</email></vendor>

<product id="100">

  <name>100 Server</name>

  <version>1.0</version>

  <last-release>20030401</last-release><changes></changes>

</product></xsa>

Completing the Document

If you look carefully at Listing 1, you'll notice that what I've created is really just the top-level XSA element, and not the entire XML document. There is no XML declaration, and no XSA document type declaration (which is required for it to be a valid XSA document). XIST does allow for all this added detail. To create a full XML document you use an ll.xist.xsc.Frag object, which can gather together all the needed nodes, including declarations. Listing 3 illustrates this. You can run it by just pasting in part one from the top of Listing 1. I didn't reproduce Part 1 in order to save space.

Listing 3: Using XIST to Generate a Proper XSA Document

XSA_PUBLIC = 

 "-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML"

XSA_SYSTEM = 

 "http://www.garshol.priv.no/download/xsa/xsa.dtd"



class xsa_doctype(xsc.DocType):

    """

    Document type for XSA

    """

    def __init__(self):

     xsc.DocType.__init__(

       self, 'xsa PUBLIC "%s" "%s"'%(XSA_PUBLIC, XSA_SYSTEM)

     )



doc = xsc.Frag(

    xml.XML10(),

    xsa_doctype(),

    xsa(

        vendor(

            name(u"Centigrade systems"),

            email(u"info@centigrade.bogus"),

        ),

        product(

            name(u"100\u00B0 Server"),

            version(u"1.0"),

            last_release(u"20030401"),

            changes(),

            id = u"100\u00B0"

        )

    )

)



print doc.asBytes(encoding="utf-8")

This time I create an explicit document type declaration class and bundle this into a document fragment along with an instance of ll.xist.ns.xml.XML10, which represents the XML declaration. Listing 4 shows the resulting output. Again the actual output is all on one line, but I have inserted line feeds for formatting reasons.

Listing 4: Output from the Variation in Listing 3

<?xml version='1.0' encoding='utf-8'?>

<!DOCTYPE xsa

PUBLIC 

 "-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML"

"http://www.garshol.priv.no/download/xsa/xsa.dtd">

<xsa><vendor><name>Centigrade systems</name>

<email>info@centigrade.bogus</email></vendor>

<product id="100">

  <name>100 Server</name>

  <version>1.0</version>

  <last-release>20030401</last-release><changes></changes>

</product></xsa>

Reading XML

XIST provides parsers that you can use to read XML into the sorts of XIST data structures I describe above. It's really quite simple, so I'll get right to it. Listing 5 is a simple example using XIST to parse a Docbook instance.

Listing 5: Using XIST to Parse an XML Document

from ll.xist import xsc

from ll.xist import parsers

#You must import this XIST namespace module, otherwise you

#get a validation error because the parser does not Know the

#vocabulary

from ll.xist.ns import docbook



DOC = """\

<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook V4.1//EN">

<article>

  <articleinfo>

  <title>DocBook article example</title>

  <author>

    <firstname>Uche</firstname>

    <surname>Ogbuji</surname>

  </author>

  </articleinfo>

  <section label="main">

    <title>Quote from "I Try"</title>

    <blockquote>

     <attribution>Talib Kweli</attribution>

     <para>

     Life is a beautiful struggle

     People search through the rubble for a suitable hustle

     Some people using the noodle

     Some people using the muscle

     Some people put it all together,

     make it fit like a puzzle

      </para>

    </blockquote>

  </section>

</article>

"""



doc = parsers.parseString(DOC)

I'll work interactively from this listing to show some of the tree navigation facilities for XIST trees. First I'll show how to use XIST iterators to search for the blockquote element.

$ python -i listing5.py

>>> blockquotes = doc.walk(xsc.FindTypeAll(docbook.blockquote))

>>> bq = blockquotes.next()

>>> print bq



      Talib Kweli



        Life is a beautiful struggle

        People search through the rubble

        for a suitable hustle

        Some people using the noodle

        Some people using the muscle

        Some people put it all together,

        make it fit like a puzzle





>>> print bq.asBytes()

<blockquote>

      <attribution>Talib Kweli</attribution>

      <para>

        Life is a beautiful struggle

        People search through the rubble

        for a suitable hustle

        Some people using the noodle

        Some people using the muscle

        Some people put it all together,

        make it fit like a puzzle

      </para>

    </blockquote>

>>>

The walk method creates an iterator over the nodes in document order. xsc.FindTypeAll creates a filter that restricts the iterator to find all instances of all the given elements within the subtree. There is also xsc.FindType, which searches only the immediate children of the node. So, to find the attribution of the quote:

>>> attribs =

       bq.content.walk(xsc.FindTypeAll(docbook.attribution))

>>> attrib = attribs.next()

>>> print attrib

Talib Kweli

>>>

Once you find an element of interest, it's trivial to access one of its attributes. They are available as if items in a dictionary.

>>> sections =

                  doc.walk(xsc.FindTypeAll(docbook.section))

>>> sect = sections.next()

>>> print sect[u"label"]

main

>>>

XIST also takes advantage of Python's operator overloading to support a language in some ways like XPath, but given as Python expressions rather than strings (Unicode objects, to be precise). This language is called XFind. The examples in the documentation look very interesting, but I had some trouble getting the expected results from XFind expressions. I couldn't be sure whether it was something I was doing wrong or quirks in the library, so I'll leave exploring XFind more deeply for another time. I encourage you to experiment with XFind, though. Many people have called for such a pure Python take on XPath, and it looks as if XIST is well on its way down this road.

Wrap Up

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

It's surprising that XIST is such a dark horse. It has been around for a long time. It has a lot of very original and interesting capabilities. It's pretty well documented, and has a mature feel about it. Yet I had never tried it before working on this article, and I don't think I know of anyone else who had. Based on my experimentation, it is definitely worth serious consideration when you're looking for a Python-esque XML processing toolkit. The extremely object-oriented framework can feel a bit heavy, but I can appreciate some of the resulting benefits, and it would certainly suit some users' tastes very well. I should also mention that there is a lot more to XIST that I was able to cover in this article. I didn't touch on its support for different HTML and XHTML vocabularies, XML namespaces, XML entities, validation and content models, tree modification, pretty printing, image manipulation, and more.

I could only find one new development to report on regarding XML in the Python space. It's the interesting news that Fred Drake, Pythonista extraordinaire, appears to have started chipping in on the ZSI project for Python Web services. He made the announcement of ZSI 1.7. For those who are still interested in mainstream Web services, this is surely great news.