Using libxml in Python
The GNOME project, an open source umbrella projects like Apache and KDE, has spawned several useful subprojects. A few years ago the increase of interest in XML processing in GNOME led to the development of a base XML processing library and, subsequently, an XSLT library, both of which are written in C, the foundational language of GNOME. These libraries, libxml and libxslt, are popular for users of C, but also those of the many other languages for which wrappers have been written, as well as language-agnostic users who want good command-line tools.
libxml and libxslt are popular because of their speed, active development, and coverage of many XML specifications with close attention to conformance. They are also available on many platforms. Daniel Veillard is the lead developer of these libraries as well as their Python bindings. He participates on the XML-SIG and has pledged perpetual support for the Python bindings; however, as the documentation says, "the Python interface [has] not yet reached the maturity of the C API."
In this article I'll introduce the Python libxml bindings, which I refer to as Python/libxml. In particular I introduce libxml2. I am using Red Hat 9.0 so installation was a simple matter of installing RPMs from the distribution disk or elsewhere. The two pertinent RPMs in my case are libxml2-2.5.4-1 and libxml2-python-2.5.4-1. The libxml web page offers installation instructions for users of other distributions or platforms, including Windows and Mac OS X.
libxml exposes a Python interface similar to its C interface. It's unrelated to DOM or any of the other Python interfaces and is fairly complex. To get a flavor of it, see the demonstration in listing 1.
Listing 1: A simple example of the basic libxml2 API
import libxml2 DOC = """<?xml version="1.0" encoding="UTF-8"?> <verse> <attribution>Christopher Okibgo</attribution> <line>For he was a shrub among the poplars,</line> <line>Needing more roots</line> <line>More sap to grow to sunlight,</line> <line>Thirsting for sunlight</line> </verse> """ doc = libxml2.parseDoc(DOC) root = doc.children print root #iterate over children of verse child = root.children while child is not None: print child if child.type == "element": print "\tAn element with ", child.lsCountNode(), "child(ren)" print "\tAnd content", repr(child.content) child = child.next doc.freeDoc()
The entire Python API wrapper is in the module
which largely delegates to a C/Python extension in the file
libxml2mod.so on my machine, which in turn uses the core
parseDoc is one of a family of
functions for parsing XML documents, DTDs, and more. There are also
parseFile for reading instances and
parseDTD to read an external DTD subset.
In this listing I use the most literal of the several approaches for
walking through nodes, the one closest to the core C API. The
children attribute gets the first child node of the instance
node in document order. This makes the name a bit misleading, but you can
get to the rest of the children using what is in effect a doubly-linked
list, where the
prev attributes link
the list together, and
last can be used to shuttle to the
parent.children in effect moves "up" and then back to
the start of the list; it can be used in place of the nonexistent
next links eventually run off the end of the list,
None, which terminates my while loop. Within each
iteration I print each node, including some special information for
elements. To determine which nodes are elements, I use the
type attribute, which returns a string indicating the node
lsCountNode() gives the count of child nodes and
content gives a string consisting of the content of all
descendant text nodes. Finally, in order to deallocate the low-level C
constructs throughout the document, I call
freeNode() is also available for more fine-grained memory
management, usually when using libxml to modify documents.
The following is the output from listing 1.
<xmlNode (verse) object at 0x8136dac> <xmlNode (text) object at 0x8134c94> <xmlNode (attribution) object at 0x8135f04> An element with 1 child(ren) And content 'Christopher Okibgo' <xmlNode (text) object at 0x8134c94> <xmlNode (line) object at 0x8135f04> An element with 1 child(ren) And content 'For he was a shrub among the poplars,' <xmlNode (text) object at 0x8134c94> <xmlNode (line) object at 0x8135f04> An element with 1 child(ren) And content 'Needing more roots' <xmlNode (text) object at 0x8134c94> <xmlNode (line) object at 0x8135f04> An element with 1 child(ren) And content 'More sap to grow to sunlight,' <xmlNode (text) object at 0x8134c94> <xmlNode (line) object at 0x8135f04> An element with 1 child(ren) And content 'Thirsting for sunlight' <xmlNode (text) object at 0x8134c94>
Iterators. There is also an iterators interface for Python 2.2 users, which is a little more Pythonic. As an example, the following snippet is the functional equivalent of the loop in listing 1.
for child in root: print child if child.type == "element": print "\tAn element with ", child.lsCountNode(), "child(ren)" print "\tAnd content", repr(child.content)
Beyond ASCII. As is the GNOME convention, libxml represents Unicode objects as simple strings encoded as UTF-8. This extends to Python/libxml, where rather than using Python Unicode objects, simple Python strings in UTF-8 encoding are returned. Listing 2 gives an example of the behavior of Python/libxml when processing non-ASCII characters.
Listing 2: Simple libxml example handling non-ASCII characters
DOC = """<?xml version="1.0" encoding="UTF-8"?> <rule>In any triangle, each interior angle < 90°</rule> """ doc = libxml2.parseDoc(DOC) root = doc.children print "Content:", repr(root.content) print "As Unicode:", repr(unicode(root.content, "utf-8")) doc.freeDoc()
I still strongly advocate using Python Unicode objects rather than encoded strings when processing XML. I suggest that Python/libxml users convert to and from Unicode when interfacing from the library to application code. But I admit that his might be awkward in some cases and might incur a small performance hit. I do think it would be best for Python/libxml to switch to Python Unicode objects as the basic string type.
A word about documentation. Python-XML projects have been
notorious for poor documentation, which is one of the considerations that
inspires this column. It was especially difficult for me to get a handle
on libxml because of its remarkable richness and thus complexity, combined
with elusive documentation. I cobbled together enough understanding of
the API to put together the listings above only after combing the on-line
documentation on the libxml site (which mostly covers C), reading through
all the Python API example and test scripts, reading the Python source for
libxml2 module and in a couple of cases the C source of
the extension module. The mailing list is very helpful, as I found while
skimming and searching the archives, but you may need some trial and error
to understand the nuances of using this very rich API. Luckily, I think
the path to understanding is a bit more clear using the most recent
addition to the libxml API family.
A Loaner from Redmond
libxml comes from one of the firmest bastions of the open-source software movement, which is often held up as the only current, real competition to Microsoft. Yet, as ever, the OSS camp is happy to borrow useful ideas from Microsoft here and there. One good example is the XmlTextReader interface, inspired by the XmlTextReader and XmlReader classes of C# and .NET. These are basically a variation on pull DOM and thus a hybrid between SAX's approach -- stream through and process a particular window of markup -- and DOM's -- walk through the hierarchy and manipulate nodes in place. XmlTextReader is a new addition to the API and some developers find it simpler. Also, the tree-based API I introduced in the last section loads the entire document into memory. XmlTextReader only loads nodes on demand and so is more efficient.
Listing 3 uses the XmlTextReader API to perform similar processing as in listing 1.
Listing 3: An example of the XmlTextReader interface
import cStringIO import libxml2 DOC = """<?xml version="1.0" encoding="UTF-8"?> <verse> <attribution>Christopher Okibgo</attribution> <line>For he was a shrub among the poplars,</line> <line>Needing more roots</line> <line>More sap to grow to sunlight,</line> <line>Thirsting for sunlight</line> </verse> """ XMLREADER_START_ELEMENT_NODE_TYPE = 1 stream = cStringIO.StringIO(DOC) input_source = libxml2.inputBuffer(stream) reader = input_source.newTextReader("urn:bogus") while reader.Read(): print "node name: ", reader.Name() if reader.NodeType() == XMLREADER_START_ELEMENT_NODE_TYPE: print "Start of an element"
I start by wrapping the source string
DOC in a
StringIO object, which can be wrapped by libxml's
inputBuffer class which,
among other things, allows me to create an
object for the stream. If I were starting from an actual file or URI in
the first place I could use the object
shortcut function. Since in listing 2 I am not working from a URI, I have
to supply the URI when I create the
xmlTextReader -- probably
for the same reasons that the 4Suite APIs insist on a URI for XML sources
(see my earlier article on 4Suite for a discussion of this). Here I use a
bogus URI as a placeholder.
reader object iterates over the low-level XML
structure in much the same way as SAX, generating events for start and end
elements, attributes (deviating from SAX in which attributes are bundled
with their elements), text, CDATASections, the document node itself, and
the rest of the menagerie. But rather than invoking call-backs, the
Read() method forwards to the next such event, and returns it
directly as an encapsulated object. Each event carries basic information
that is available from the node itself, without having to consider its
children or any other related events. In the simple example all the node
names are printed. I left out
the code to count child elements and display the content subtree for
simplicity, because it would involve either considering the interaction of
several events using a state machine of some sort or using the
Expand() method to walk through enough subsequent events to
extract a regular
libxml subtree from the
In order to branch to special processing for start element events, I
NodeType(), which returns a node identifier based on the
constants defined in DOM. You'll notice that I don't have to do anything
special to clean up after this program, unlike the plain tree interface.
If you run listing 2, the only unusual thing you're likely to notice is
that the text nodes are given the node name
#text, which is
the DOM convention. Nodes other than element and attribute nodes all have
these special node names.
Wrap up and current events
Also in Python and XML
libxml also offers a SAX API, both through the low-level API and and through the bundled drv_libxml2.py, a libxml driver for the SAX that comes with Python and PyXML. libxml supports W3C XML Schema, RELAX NG, OASIS catalogs, XInclude, XML Base, and more. There are also extensive features for manipulating XML documents. I hope to cover these other features of this rich library in subsequent articles.
Moving on to the usual coverage of interesting events and resources in the Python-XML community, Brian Quinlan announced the latest version (0.8.0) of Pyana, a Python interface to the Xalan C XSLT processor. New developments include support for node sets as XPath extension function arguments, Python wide Unicode and Mac OS X builds, and validation using external schemas.
Making a new appearance is Skyron, an interesting little Python module that transforms XML documents according to simple "recipes" which are expressed in XML. These recipes bind XML data to handler code in Python. Typical usage is to create a specilized Python data structure from particular XML data patterns.
- Doesn't work if first line is comment.
2009-12-22 17:56:07 Johncc
2004-10-24 12:56:38 libxml2
- Can't get Attributes in tree walk
2004-10-24 07:49:06 libxml2
2003-05-15 22:04:43 Stephen Blake
- Great article
2003-05-15 22:03:44 Stephen Blake