XML Namespaces Support in Python Tools, Part Three
June 30, 2004
In the last two articles I've discussed namespace handling in Python 2.3's SAX and minidom libraries and in 4Suite. In this article I focus on ElementTree, libxml/Python and PyRXPU. I recommend reading or reviewing those articles first, as well as the earlier articles in this namespace series (part 1 and part 2).
I shall be using, where applicable, the same scenarios I did in the prior articles, based on the same namespace torture test document.
Listing 1: Sample document that uses many XML namespace features and oddities
<products> <product id="1144" xmlns="http://example.com/product-info" xmlns:html="http://www.w3.org/1999/xhtml" > <name xml:lang="en">Python Perfect IDE</name> <description> Uses mind-reading technology to anticipate and accommodate all user needs in Python development. Implements all <html:code>from __future__ import</html:code> features though the year 3000. Works well with <code>1166</code>. </description> </product> <p:product id="1166" xmlns:p="http://example.com/product-info"> <p:name>XSLT Perfect IDE</p:name> <p:description xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xl="http://www.w3.org/1999/xlink" > <p:code>red</p:code> <html:code>blue</html:code> <html:div> <ref xl:type="simple" xl:href="index.xml">A link</ref> </html:div> </p:description> </p:product> </products>
Reading namespaces with ElementTree
I already covered aspects of the read namespace API of ElementTree in my earlier article. As I mentioned then, ElementTree supports XML namespaces using James Clark's notation directly for element and attribute names. This is a rather different mechanism from most XML processing APIs, and we'll find out how smoothly it works in comparison. Listing 2 displays the local name, namespace and prefix of each element and attribute in a document. I did update to version 1.2c1-20040615 of the software.
Listing 2: ElementTree code to display namespace information for elements and attributes
import sys from elementtree.ElementTree import ElementTree, XMLTreeBuilder class ns_tracker_tree_builder(XMLTreeBuilder): def __init__(self): XMLTreeBuilder.__init__(self) self._parser.StartNamespaceDeclHandler = self._start_ns self.namespaces = {u'http://www.w3.org/XML/1998/namespace': u'xml'} def _start_ns(self, prefix, ns): self.namespaces[ns] = prefix def analyze_clark_name(name, nsdict): if name[0] == '{': ns, local = name[1:].split("}") else: return None, name, None prefix = nsdict[ns] if prefix is None: prefix = u"!Unknown" return prefix, local, ns parser = ns_tracker_tree_builder() etree = ElementTree() root = etree.parse(sys.argv[1], parser) #Create an iterator iter = root.getiterator() #Iterate for elem in iter: prefix, local, ns = analyze_clark_name(elem.tag, parser.namespaces) print "Element namespace:", repr(ns) print "Element local name:", repr(local) print "Prefix used for element:", repr(prefix) for name, value in elem.items(): prefix, local, ns = analyze_clark_name(name, parser.namespaces) print "Attribute namespace:", repr(ns) print "Attribute local name:", repr(local) print "Prefix used for attribute:", repr(prefix)
As I discussed in the earlier article, ElementTree does not maintain namespace prefix
information. This made my task in this listing much trickier. I found out how to use
a
specialized class to build the element tree, defined as ns_tracker_tree_builder
in listing 2. This class receives expat parse events, but I was only able to figure
out how
to capture information from the namespace events in a "flat" manner: by updating a
single
dictionary each time I encounter a namespace declaration event (_start_ns
). The
problem with this is that all namespace scoping information is lost. I expect this
approach
will cause oddities in any document where a given namespace is used with more than
one
prefix at different points. I generally do not recommend such confusing use of namespaces
in
the first place (see my article Use XML namespaces
with care" for more details); in listing 1 I break my own rules because I want to test
how XML processing libraries handle even untidy use of namespaces.
I can get a partial solution that maintains prefix information by using my specialized
builder. The next challenge is using the resulting dictionary to extract prefixes,
namespaces, and local names from the full James Clark notation. I created the function
analyze_clark_name
for this purpose. The rest of the listing is
straightforward ElementTree code that completes the task at hand. The result is given
in
listing 3.
Listing 3: Output from listing 2 run against listing 1
Element namespace: None Element local name: 'products' Prefix used for element: None Element namespace: 'http://example.com/product-info' Element local name: 'product' Prefix used for element: u'p' Attribute namespace: None Attribute local name: 'id' Prefix used for attribute: None Element namespace: 'http://example.com/product-info' Element local name: 'name' Prefix used for element: u'p' Attribute namespace: 'http://www.w3.org/XML/1998/namespace' Attribute local name: 'lang' Prefix used for attribute: u'xml' Element namespace: 'http://example.com/product-info' Element local name: 'description' Prefix used for element: u'p' Element namespace: 'http://www.w3.org/1999/xhtml' Element local name: 'code' Prefix used for element: u'html' Element namespace: 'http://example.com/product-info' Element local name: 'code' Prefix used for element: u'p' Element namespace: 'http://example.com/product-info' Element local name: 'product' Prefix used for element: u'p' Attribute namespace: None Attribute local name: 'id' Prefix used for attribute: None Element namespace: 'http://example.com/product-info' Element local name: 'name' Prefix used for element: u'p' Element namespace: 'http://example.com/product-info' Element local name: 'description' Prefix used for element: u'p' Element namespace: 'http://example.com/product-info' Element local name: 'code' Prefix used for element: u'p' Element namespace: 'http://www.w3.org/1999/xhtml' Element local name: 'code' Prefix used for element: u'html' Element namespace: 'http://www.w3.org/1999/xhtml' Element local name: 'div' Prefix used for element: u'html' Element namespace: None Element local name: 'ref' Prefix used for element: None Attribute namespace: 'http://www.w3.org/1999/xlink' Attribute local name: 'href' Prefix used for attribute: u'xl' Attribute namespace: 'http://www.w3.org/1999/xlink' Attribute local name: 'type' Prefix used for attribute: u'xl'
Scrutinizing this output I found a few problems, which I've marked in bold. As expected they involved the fact that my workaround for recording prefixes does not take into account the scope of namespace declarations and, in effect, always reports the last prefix seen for any given namespace. Notice also the fact that plain strings are returned in most cases rather than Unicode objects. I find this problematic.
ElementTree namespace mutation
The stock list of mutation tasks I've been using to test namespace handling is as follows:
- Add a new element in the products namespace, but using no prefix.
- Add a new element with a prefix and in the products namespace.
- Add a new element that is not in any namespace.
- Add a new global attribute in the XHTML namespace.
- Add a new global attribute in the special XML namespace.
- Add a new attribute in no namespace.
- Remove only the
code
element in the XHTML namespace - Remove a global attribute
- Remove an attribute that is not in any namespace
Listing 4 includes code for the various tasks.
Listing 4: ElementTree code for the sample mutation tasks
import sys from elementtree.ElementTree import ElementTree, SubElement doc = ElementTree(file='products.xml') PRODUCT_NS = u'http://example.com/product-info' HTML_NS = u'http://www.w3.org/1999/xhtml' XML_NS = u'http://www.w3.org/XML/1998/namespace' XLINK_NS = u'http://www.w3.org/1999/xlink' #Task 1 is not really possible #Task 2 product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0] new_element = SubElement(product, u'{%s}launch-date'%PRODUCT_NS) #Task 3 product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0] new_element = SubElement(product, u'island') #Task 4 div = doc.getiterator(u'{%s}div'%HTML_NS)[0] div.set(u'{%s}global'%HTML_NS, u'spam') #Task 5 div.set(u'{%s}lang'%XML_NS, u'en') #Task 6 div.set(u'class', u'eggs') #Task 7 for desc in doc.getiterator(u'{%s}description'%PRODUCT_NS): code = desc.getiterator(u'{%s}code'%HTML_NS)[0] desc.remove(code) #Task 8 ref = doc.getiterator(u'ref')[0] del ref.attrib[u'{%s}href'%XLINK_NS] #Task 9 product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0] del product.attrib[u'id'] #write out the modified XML doc.write(sys.stdout)
In general I navigate the tree by using the Clark notation name to create an iterator
over
all elements with the namespace and local name I want. I didn't bother to check the
performance of this approach: it may be faster to use a path expression for this,
although
my experiments didn't yield a way to use the standard namespace conventions for XPath
in
ElementTree. Looking through the test routines I saw code along the lines of
elem.findall("//{http://spam}egg")
, but this is not at all valid XPath.
Nevertheless, I tried doc.find(u'.//{%s}product'%PRODUCT_NS)
and variations on
this (including specifying the full path, and starting with doc.getroot()
). No
expression I tried returned any results, so I fell back to the tree-wide find/iterate
approach. All the ElementTree mutation interfaces worked as expected with the use
of Clark
notation. Listing 5 is the output from this script. You'll notice immediately the
lack of
prefix preservation, but with the exception of task 1, which I was unable to accomplish
in
ElementTree, the results are correct. ElementTree even correctly handles creation
of global
attributes such as html:global
which happen to share their parent's namespace.
All other tools I've examined so far have incorrectly omitted a prefix in this case.
Listing 5: Output from listing 4
<products> <ns0:product xmlns:ns0="http://example.com/product-info"> <ns0:name xml:lang="en">Python Perfect IDE</ns0:name> <ns0:description> Uses mind-reading technology to anticipate and accommodate all user needs in Python development. Implements all <ns0:code>1166</ns0:code>. </ns0:description> <ns0:launch-date /><island /></ns0:product> <ns1:product id="1166" xmlns:ns1="http://example.com/product-info"> <ns1:name>XSLT Perfect IDE</ns1:name> <ns1:description> <ns1:code>red</ns1:code> <html:div class="eggs" html:global="spam" xml:lang="en" xmlns:html="http://www.w3.org/1999/xhtml"> <ref ns3:type="simple" xmlns:ns3="http://www.w3.org/1999/xlink">A link</ref> </html:div> </ns1:description> </ns1:product> </products>
Reading namespaces with libxml/Python
As I discussed in my article about libxml there are several available mechanisms for processing XML, including SAX and DOM variations. I focused on the more unusual API, XmlTextReader, and in this discussion of namespace processing I shall continue to focus on this API, which means that I'll only worry about how to read namespace information. You should be able to perform mutation using similar DOM idioms to those I presented in the first namespace article. Listing 6 is the XmlTextReader equivalent of listing 2.
Listing 6: libxml code to display namespace information for elements and attributes
import sysimport cStringIO import libxml2 XMLNS_NS = 'http://www.w3.org/2000/xmlns/' XMLREADER_START_ELEMENT_NODE_TYPE = 1 input = open(sys.argv[1]) input_source = libxml2.inputBuffer(input) reader = input_source.newTextReader("urn:bogus") while reader.Read(): if reader.NodeType() == XMLREADER_START_ELEMENT_NODE_TYPE: print "Element namespace:", repr(reader.NamespaceUri()) print "Element local name:", repr(reader.LocalName()) print "Prefix used for element:", repr(reader.Prefix()) while reader.MoveToNextAttribute(): #Ignore namespace declarations if reader.NamespaceUri() != XMLNS_NS: print "Attribute namespace:", repr(reader.NamespaceUri()) print "Attribute local name:", repr(reader.LocalName()) print "Prefix used for attribute:", repr(reader.Prefix())
Besides the fact that again the element names and prefixes are not returned as Unicode objects, the results are as expected.
PyRXPU and namespaces
PyRXPU is part of the PyRXP package, and the only part I recommend using, as I discussed at length in my recent article on PyRXP. PyRXP is "non-Unicode" by default, but this default configuration is not an XML parser at all. You do have to get a CVS version of the package in order to use PyRXPU. The latest release, 0.9, does not include it. I provided details for installing from CVS in my earlier article. I did try to update to more recent CVS code this time, but my attempts to use PyRXPU in the latest PyRXP code resulted in core dumps on my Dell Inspiron 8600 running Fedora Core 2, so I reverted to the CVS code I used back in February. I didn't see anything in the CVS logs since February indicating any significant changes in namespace handling, so I assumed this would still be a current test.
By default PyRXP doesn't do any special namespace processing and returns namespace
declarations as regular attributes. There are several parser parameters regarding
namespace
processing. One, ReturnNamespaceAttributes
, is described strangely in the
documentation as not returning XML namespace declarations by default. This seems to
be
incorrect. The second, XMLNamespaces
, is described in the documentation thusly:
If this is on, the parser processes namespace declarations (see below). Namespace declarations are not returned as part of the list of attributes on an element.
I wasn't able to find whatever passage might have been referenced in the "see below" phrase: this sentence was pretty much the last one concerning namespaces in the document. I came to wish I could find more on namespaces once I tried out namespace processing in listing 7.
Listing 7: Code to parse a document in namespace processing mode
import sys import pyRXPU parser = pyRXPU.Parser() parser.XMLNamespaces = 1 doc_source = open(sys.argv[1]).read() doc = parser.parse(doc_source) import pprint pprint.pprint(doc)
The result of running this against listing 1 is very odd:
Listing 8: results of namespace-aware reading of listing 1 in PyRXPU
(u'products', None, [u'\n ', (u'product', {u'id': u'1144'}, [u'\n ', (u'name', {u'xml:lang': u'en'}, [u'Python Perfect IDE'], None), u'\n ', (u'description', None, [u'\n Uses mind-reading technology to anticipate and ' 'accommodate\n all user needs in Python development. ' 'Implements all\n ', (u'html:code', None, [u'from __future__ import'], None), u' features though\n the year 3000. Works well with ', (u'code', None, [u'1166'], None), u'.\n '], None), u'\n '], None), u'\n ', (u'p:product', {u'id': u'1166'}, [u'\n ', (u'p:name', None, [u'XSLT Perfect IDE'], None), u'\n ', (u'p:description', None, [u'\n ', (u'p:code', None, [u'red'], None), u'\n ', (u'html:code', None, [u'blue'], None), u'\n ', (u'html:div', None, [u'\n ', (u'ref', {u'xl:type': u'simple', u'xl:href': u'index.xml'}, [u'A link'], None), u'\n '], None), u'\n '], None), u'\n '], None), u'\n'], None)
The important information -- the namespaces -- is omitted while the unimportant details
--
the prefixes -- are included as part of element names. This makes namespace processing
very
difficult. I tried a lot of tweaking and other options to try to get all the information
needed for ready namespace processing without having to knit it all back together
by hand
after turning off the namespace option (the only difference upon omitting the
parser.XMLNamespaces = 1
line is that namespace declarations are returned as
attributes). In the end I was not really able to tackle any of the namespace reading
or
mutation tasks without processing namespaces entirely by hand (which you can do with
any
toolkit, namespace aware or no), and I conclude that PyRXPU does not really support
namespace processing.
Wrap up
In this batch of namespace tests the results have been a mixed bag. ElementTree supports namespaces properly, but makes it very hard to work with prefixes, which is acceptable given that prefixes are a mere syntactic convenience. I would hesitate to use ElementTree where I needed the convenience of preserved prefixes. PyRXPU seems to either report namespace declarations literally, without any API benefits, or discard the namespaces information altogether, which is as much to say it doesn't support namespace processing. libxml, as one expects from such a comprehensive library, handles namespaces effortlessly. I barely scratched the surface in this article of how to process namespaces in libxml, but I do show the SAX and DOM approaches in my earlier article. I expect to wrap up this series on namespace processing next by looking at how some data binding tools handle namespaces.
It has been a busy month for my colleagues in the Python-XML community, including work by Brett Hartshorn on yet another small Python DOM implementation, xmlapi 0.2.1. It's billed as an "even smaller XML DOM implementation than Python's standard xml.dom.minidom" and claims performance and feature improvements over minidom.
Philippe Normand debuted XMLObject 0.0.2, a data binding tool which allows you to map from customized Python classes to XML and vice versa. See the announcement.
Fredrik Lundh announced ElementTree 1.2. I'm somewhat confused at this point as to whether the package is supposed to be called "ElementTree" or "elementtree", but I think current clues suggest the former. This release just makes official the various experimental features such as XPath support which I have already discussed. 1.2 final appeared after I wrapped up this article and the namespace discussion is based on the most recent beta rather than the final 1.2 release, but I expect not much has changed in between the two. See the announcement.
Brian Quinlan, who has also been busy helping organize the Vancouver Python Workshop, announced Pyana 0.9.0, the latest release of his Python interface to the Xalan-C XSLT processor. Changes include updated for Xalan 1.8/Xerces 2.5, basic support for tracing, and removal of the transform-to-DOM support, with promises of a better replacement in future. See the announcement.
Magnus Lie Hetland has updated Atox to version 0.5. Atox allows you to write custom scripts for converting plain text into XML. You define the text to XML binding using a simple XML language. It's meant to be used from the command line. Changes since 0.1 include language improvements, added support for config files and XSLT fragments in Atox format files. . See the full announcement.
Also in Python and XML |
|
Should Python and XML Coexist? |
|
Michael Twomey announced pygenx 0.5.2, a wrapper for Tim Bray's XML generation library Genx. Genx is a C library and its output is canonical XML. PyGenx wraps the full API. See the announcement
Christof Hoeke announced pyXSLdoc 0.51, a Python tool for generating documentation of XSLT code in a similar approach to Javadoc. Version 0.60b is actually available, but the most recent announcement is for 0.52.
I recently discovered the Simple Objects from XML (SOX) module buried deep within the Python Enterprise Application Kit (PEAK). PEAK is a components toolkit for large-scale applications (the developers claim it is as powerful as J2EE but not as complex). SOX is another XML data binding toolkit which uses SAX events to build an object the user can define based on classes set up for namespace aware or namespace oblivious usage.