XML Namespaces Support in Python Tools, Part Three

June 30, 2004

In the last two articles I've discussed namespace handling in Python 2.3's SAX and minidom libraries and in 4Suite. In this article I focus on ElementTree, libxml/Python and PyRXPU. I recommend reading or reviewing those articles first, as well as the earlier articles in this namespace series (part 1 and part 2).

I shall be using, where applicable, the same scenarios I did in the prior articles, based on the same namespace torture test document.

Listing 1: Sample document that uses many XML namespace features and oddities

<products>

  <product id="1144"

    xmlns="http://example.com/product-info"

    xmlns:html="http://www.w3.org/1999/xhtml"

  >

    <name xml:lang="en">Python Perfect IDE</name>

    <description>

      Uses mind-reading technology to anticipate and accommodate

      all user needs in Python development.  Implements all

      <html:code>from __future__ import</html:code> features though

      the year 3000.  Works well with <code>1166</code>.

    </description>

  </product>

  <p:product id="1166" xmlns:p="http://example.com/product-info">

    <p:name>XSLT Perfect IDE</p:name>

    <p:description

      xmlns:html="http://www.w3.org/1999/xhtml"

      xmlns:xl="http://www.w3.org/1999/xlink"

    >

      <p:code>red</p:code>

      <html:code>blue</html:code>

      <html:div>

        <ref xl:type="simple" xl:href="index.xml">A link</ref>

      </html:div>

    </p:description>

  </p:product>

</products>

Reading namespaces with ElementTree

I already covered aspects of the read namespace API of ElementTree in my earlier article. As I mentioned then, ElementTree supports XML namespaces using James Clark's notation directly for element and attribute names. This is a rather different mechanism from most XML processing APIs, and we'll find out how smoothly it works in comparison. Listing 2 displays the local name, namespace and prefix of each element and attribute in a document. I did update to version 1.2c1-20040615 of the software.

Listing 2: ElementTree code to display namespace information for elements and attributes

import sys

from elementtree.ElementTree import ElementTree, XMLTreeBuilder



class ns_tracker_tree_builder(XMLTreeBuilder):

    def __init__(self):

        XMLTreeBuilder.__init__(self)

        self._parser.StartNamespaceDeclHandler = self._start_ns

        self.namespaces = {u'http://www.w3.org/XML/1998/namespace':

                           u'xml'}

 

    def _start_ns(self, prefix, ns):

        self.namespaces[ns] = prefix



def analyze_clark_name(name, nsdict):

    if name[0] == '{':

        ns, local = name[1:].split("}")

    else:

        return None, name, None

    prefix = nsdict[ns]

    if prefix is None:

        prefix = u"!Unknown"

    return prefix, local, ns

        

parser = ns_tracker_tree_builder()

etree = ElementTree()

root = etree.parse(sys.argv[1], parser)



#Create an iterator

iter = root.getiterator()

#Iterate

for elem in iter:

    prefix, local, ns = analyze_clark_name(elem.tag, parser.namespaces)

    print "Element namespace:", repr(ns)

    print "Element local name:", repr(local)

    print "Prefix used for element:", repr(prefix)

    for name, value in elem.items():

        prefix, local, ns = analyze_clark_name(name, parser.namespaces)

        print "Attribute namespace:", repr(ns)

        print "Attribute local name:", repr(local)

        print "Prefix used for attribute:", repr(prefix)

As I discussed in the earlier article, ElementTree does not maintain namespace prefix information. This made my task in this listing much trickier. I found out how to use a specialized class to build the element tree, defined as ns_tracker_tree_builder in listing 2. This class receives expat parse events, but I was only able to figure out how to capture information from the namespace events in a "flat" manner: by updating a single dictionary each time I encounter a namespace declaration event (_start_ns). The problem with this is that all namespace scoping information is lost. I expect this approach will cause oddities in any document where a given namespace is used with more than one prefix at different points. I generally do not recommend such confusing use of namespaces in the first place (see my article Use XML namespaces with care" for more details); in listing 1 I break my own rules because I want to test how XML processing libraries handle even untidy use of namespaces.

I can get a partial solution that maintains prefix information by using my specialized builder. The next challenge is using the resulting dictionary to extract prefixes, namespaces, and local names from the full James Clark notation. I created the function analyze_clark_name for this purpose. The rest of the listing is straightforward ElementTree code that completes the task at hand. The result is given in listing 3.

Listing 3: Output from listing 2 run against listing 1

Element namespace: None

Element local name: 'products'

Prefix used for element: None

Element namespace: 'http://example.com/product-info'

Element local name: 'product'

Prefix used for element: u'p'

Attribute namespace: None

Attribute local name: 'id'

Prefix used for attribute: None

Element namespace: 'http://example.com/product-info'

Element local name: 'name'

Prefix used for element: u'p'

Attribute namespace: 'http://www.w3.org/XML/1998/namespace'

Attribute local name: 'lang'

Prefix used for attribute: u'xml'

Element namespace: 'http://example.com/product-info'

Element local name: 'description'

Prefix used for element: u'p'

Element namespace: 'http://www.w3.org/1999/xhtml'

Element local name: 'code'

Prefix used for element: u'html'

Element namespace: 'http://example.com/product-info'

Element local name: 'code'

Prefix used for element: u'p'

Element namespace: 'http://example.com/product-info'

Element local name: 'product'

Prefix used for element: u'p'

Attribute namespace: None

Attribute local name: 'id'

Prefix used for attribute: None

Element namespace: 'http://example.com/product-info'

Element local name: 'name'

Prefix used for element: u'p'

Element namespace: 'http://example.com/product-info'

Element local name: 'description'

Prefix used for element: u'p'

Element namespace: 'http://example.com/product-info'

Element local name: 'code'

Prefix used for element: u'p'

Element namespace: 'http://www.w3.org/1999/xhtml'

Element local name: 'code'

Prefix used for element: u'html'

Element namespace: 'http://www.w3.org/1999/xhtml'

Element local name: 'div'

Prefix used for element: u'html'

Element namespace: None

Element local name: 'ref'

Prefix used for element: None

Attribute namespace: 'http://www.w3.org/1999/xlink'

Attribute local name: 'href'

Prefix used for attribute: u'xl'

Attribute namespace: 'http://www.w3.org/1999/xlink'

Attribute local name: 'type'

Prefix used for attribute: u'xl'

Scrutinizing this output I found a few problems, which I've marked in bold. As expected they involved the fact that my workaround for recording prefixes does not take into account the scope of namespace declarations and, in effect, always reports the last prefix seen for any given namespace. Notice also the fact that plain strings are returned in most cases rather than Unicode objects. I find this problematic.

ElementTree namespace mutation

The stock list of mutation tasks I've been using to test namespace handling is as follows:

Add a new element in the products namespace, but using no prefix.
Add a new element with a prefix and in the products namespace.
Add a new element that is not in any namespace.
Add a new global attribute in the XHTML namespace.
Add a new global attribute in the special XML namespace.
Add a new attribute in no namespace.
Remove only the code element in the XHTML namespace
Remove a global attribute
Remove an attribute that is not in any namespace

Listing 4 includes code for the various tasks.

Listing 4: ElementTree code for the sample mutation tasks

import sys

from elementtree.ElementTree import ElementTree, SubElement

doc = ElementTree(file='products.xml')



PRODUCT_NS = u'http://example.com/product-info'

HTML_NS = u'http://www.w3.org/1999/xhtml'

XML_NS = u'http://www.w3.org/XML/1998/namespace'

XLINK_NS = u'http://www.w3.org/1999/xlink'



#Task 1 is not really possible



#Task 2

product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]

new_element = SubElement(product, u'{%s}launch-date'%PRODUCT_NS)



#Task 3

product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]

new_element = SubElement(product, u'island')



#Task 4

div = doc.getiterator(u'{%s}div'%HTML_NS)[0]

div.set(u'{%s}global'%HTML_NS, u'spam')



#Task 5

div.set(u'{%s}lang'%XML_NS, u'en')



#Task 6

div.set(u'class', u'eggs')



#Task 7

for desc in doc.getiterator(u'{%s}description'%PRODUCT_NS):

    code = desc.getiterator(u'{%s}code'%HTML_NS)[0]

    desc.remove(code)



#Task 8

ref = doc.getiterator(u'ref')[0]

del ref.attrib[u'{%s}href'%XLINK_NS]



#Task 9

product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]

del product.attrib[u'id']



#write out the modified XML

doc.write(sys.stdout)

In general I navigate the tree by using the Clark notation name to create an iterator over all elements with the namespace and local name I want. I didn't bother to check the performance of this approach: it may be faster to use a path expression for this, although my experiments didn't yield a way to use the standard namespace conventions for XPath in ElementTree. Looking through the test routines I saw code along the lines of elem.findall("//{http://spam}egg"), but this is not at all valid XPath. Nevertheless, I tried doc.find(u'.//{%s}product'%PRODUCT_NS) and variations on this (including specifying the full path, and starting with doc.getroot()). No expression I tried returned any results, so I fell back to the tree-wide find/iterate approach. All the ElementTree mutation interfaces worked as expected with the use of Clark notation. Listing 5 is the output from this script. You'll notice immediately the lack of prefix preservation, but with the exception of task 1, which I was unable to accomplish in ElementTree, the results are correct. ElementTree even correctly handles creation of global attributes such as html:global which happen to share their parent's namespace. All other tools I've examined so far have incorrectly omitted a prefix in this case.

Listing 5: Output from listing 4

<products>

  <ns0:product xmlns:ns0="http://example.com/product-info">

    <ns0:name xml:lang="en">Python Perfect IDE</ns0:name>

    <ns0:description>

      Uses mind-reading technology to anticipate and accommodate

      all user needs in Python development.  Implements all

      <ns0:code>1166</ns0:code>.

    </ns0:description>

  <ns0:launch-date /><island /></ns0:product>

  <ns1:product id="1166" xmlns:ns1="http://example.com/product-info">

    <ns1:name>XSLT Perfect IDE</ns1:name>

    <ns1:description>

      <ns1:code>red</ns1:code>

      <html:div class="eggs" html:global="spam" xml:lang="en"

                xmlns:html="http://www.w3.org/1999/xhtml">

        <ref ns3:type="simple"

                xmlns:ns3="http://www.w3.org/1999/xlink">A link</ref>

      </html:div>

    </ns1:description>

  </ns1:product>

</products>

Reading namespaces with libxml/Python

As I discussed in my article about libxml there are several available mechanisms for processing XML, including SAX and DOM variations. I focused on the more unusual API, XmlTextReader, and in this discussion of namespace processing I shall continue to focus on this API, which means that I'll only worry about how to read namespace information. You should be able to perform mutation using similar DOM idioms to those I presented in the first namespace article. Listing 6 is the XmlTextReader equivalent of listing 2.

Listing 6: libxml code to display namespace information for elements and attributes

import sysimport cStringIO

import libxml2



XMLNS_NS = 'http://www.w3.org/2000/xmlns/'

XMLREADER_START_ELEMENT_NODE_TYPE = 1



input = open(sys.argv[1])

input_source = libxml2.inputBuffer(input)

reader = input_source.newTextReader("urn:bogus")



while reader.Read():

    if reader.NodeType() == XMLREADER_START_ELEMENT_NODE_TYPE:

        print "Element namespace:", repr(reader.NamespaceUri())

        print "Element local name:", repr(reader.LocalName())

        print "Prefix used for element:", repr(reader.Prefix())

        while reader.MoveToNextAttribute():

            #Ignore namespace declarations

            if reader.NamespaceUri() != XMLNS_NS:

                print "Attribute namespace:", repr(reader.NamespaceUri())

                print "Attribute local name:", repr(reader.LocalName())

                print "Prefix used for attribute:", repr(reader.Prefix())

Besides the fact that again the element names and prefixes are not returned as Unicode objects, the results are as expected.

PyRXPU and namespaces

PyRXPU is part of the PyRXP package, and the only part I recommend using, as I discussed at length in my recent article on PyRXP. PyRXP is "non-Unicode" by default, but this default configuration is not an XML parser at all. You do have to get a CVS version of the package in order to use PyRXPU. The latest release, 0.9, does not include it. I provided details for installing from CVS in my earlier article. I did try to update to more recent CVS code this time, but my attempts to use PyRXPU in the latest PyRXP code resulted in core dumps on my Dell Inspiron 8600 running Fedora Core 2, so I reverted to the CVS code I used back in February. I didn't see anything in the CVS logs since February indicating any significant changes in namespace handling, so I assumed this would still be a current test.

By default PyRXP doesn't do any special namespace processing and returns namespace declarations as regular attributes. There are several parser parameters regarding namespace processing. One, ReturnNamespaceAttributes, is described strangely in the documentation as not returning XML namespace declarations by default. This seems to be incorrect. The second, XMLNamespaces, is described in the documentation thusly:

If this is on, the parser processes namespace declarations (see below). Namespace declarations are not returned as part of the list of attributes on an element.

I wasn't able to find whatever passage might have been referenced in the "see below" phrase: this sentence was pretty much the last one concerning namespaces in the document. I came to wish I could find more on namespaces once I tried out namespace processing in listing 7.

Listing 7: Code to parse a document in namespace processing mode

import sys

import pyRXPU



parser = pyRXPU.Parser()

parser.XMLNamespaces = 1

doc_source = open(sys.argv[1]).read()

doc = parser.parse(doc_source)



import pprint

pprint.pprint(doc)

The result of running this against listing 1 is very odd:

Listing 8: results of namespace-aware reading of listing 1 in PyRXPU

(u'products',

 None,

 [u'\n  ',

  (u'product',

   {u'id': u'1144'},

   [u'\n    ',

    (u'name', {u'xml:lang': u'en'}, [u'Python Perfect IDE'], None),

    u'\n    ',

    (u'description',

     None,

     [u'\n      Uses mind-reading technology to anticipate and '

       'accommodate\n   all user needs in Python development.  '

       'Implements all\n      ',

      (u'html:code', None, [u'from __future__ import'], None),

      u' features though\n      the year 3000.  Works well with ',

      (u'code', None, [u'1166'], None),

      u'.\n    '],

     None),

    u'\n  '],

   None),

  u'\n  ',

  (u'p:product',

   {u'id': u'1166'},

   [u'\n    ',

    (u'p:name', None, [u'XSLT Perfect IDE'], None),

    u'\n    ',

    (u'p:description',

     None,

     [u'\n      ',

      (u'p:code', None, [u'red'], None),

      u'\n      ',

      (u'html:code', None, [u'blue'], None),

      u'\n      ',

      (u'html:div',

       None,

       [u'\n        ',

        (u'ref',

         {u'xl:type': u'simple', u'xl:href': u'index.xml'},

         [u'A link'],

         None),

        u'\n      '],

       None),

      u'\n    '],

     None),

    u'\n  '],

   None),

  u'\n'],

 None)

The important information -- the namespaces -- is omitted while the unimportant details -- the prefixes -- are included as part of element names. This makes namespace processing very difficult. I tried a lot of tweaking and other options to try to get all the information needed for ready namespace processing without having to knit it all back together by hand after turning off the namespace option (the only difference upon omitting the parser.XMLNamespaces = 1 line is that namespace declarations are returned as attributes). In the end I was not really able to tackle any of the namespace reading or mutation tasks without processing namespaces entirely by hand (which you can do with any toolkit, namespace aware or no), and I conclude that PyRXPU does not really support namespace processing.

Wrap up

In this batch of namespace tests the results have been a mixed bag. ElementTree supports namespaces properly, but makes it very hard to work with prefixes, which is acceptable given that prefixes are a mere syntactic convenience. I would hesitate to use ElementTree where I needed the convenience of preserved prefixes. PyRXPU seems to either report namespace declarations literally, without any API benefits, or discard the namespaces information altogether, which is as much to say it doesn't support namespace processing. libxml, as one expects from such a comprehensive library, handles namespaces effortlessly. I barely scratched the surface in this article of how to process namespaces in libxml, but I do show the SAX and DOM approaches in my earlier article. I expect to wrap up this series on namespace processing next by looking at how some data binding tools handle namespaces.

It has been a busy month for my colleagues in the Python-XML community, including work by Brett Hartshorn on yet another small Python DOM implementation, xmlapi 0.2.1. It's billed as an "even smaller XML DOM implementation than Python's standard xml.dom.minidom" and claims performance and feature improvements over minidom.

Philippe Normand debuted XMLObject 0.0.2, a data binding tool which allows you to map from customized Python classes to XML and vice versa. See the announcement.

Fredrik Lundh announced ElementTree 1.2. I'm somewhat confused at this point as to whether the package is supposed to be called "ElementTree" or "elementtree", but I think current clues suggest the former. This release just makes official the various experimental features such as XPath support which I have already discussed. 1.2 final appeared after I wrapped up this article and the namespace discussion is based on the most recent beta rather than the final 1.2 release, but I expect not much has changed in between the two. See the announcement.

Brian Quinlan, who has also been busy helping organize the Vancouver Python Workshop, announced Pyana 0.9.0, the latest release of his Python interface to the Xalan-C XSLT processor. Changes include updated for Xalan 1.8/Xerces 2.5, basic support for tracing, and removal of the transform-to-DOM support, with promises of a better replacement in future. See the announcement.

Magnus Lie Hetland has updated Atox to version 0.5. Atox allows you to write custom scripts for converting plain text into XML. You define the text to XML binding using a simple XML language. It's meant to be used from the command line. Changes since 0.1 include language improvements, added support for config files and XSLT fragments in Atox format files. . See the full announcement.

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Michael Twomey announced pygenx 0.5.2, a wrapper for Tim Bray's XML generation library Genx. Genx is a C library and its output is canonical XML. PyGenx wraps the full API. See the announcement

Christof Hoeke announced pyXSLdoc 0.51, a Python tool for generating documentation of XSLT code in a similar approach to Javadoc. Version 0.60b is actually available, but the most recent announcement is for 0.52.

I recently discovered the Simple Objects from XML (SOX) module buried deep within the Python Enterprise Application Kit (PEAK). PEAK is a components toolkit for large-scale applications (the developers claim it is as powerful as J2EE but not as complex). SOX is another XML data binding toolkit which uses SAX events to build an object the user can define based on classes set up for namespace aware or namespace oblivious usage.