XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XML Namespaces Support in Python Tools, Part Three

XML Namespaces Support in Python Tools, Part Three

June 30, 2004

In the last two articles I've discussed namespace handling in Python 2.3's SAX and minidom libraries and in 4Suite. In this article I focus on ElementTree, libxml/Python and PyRXPU. I recommend reading or reviewing those articles first, as well as the earlier articles in this namespace series (part 1 and part 2).

I shall be using, where applicable, the same scenarios I did in the prior articles, based on the same namespace torture test document.

Listing 1: Sample document that uses many XML namespace features and oddities

<products>
  <product id="1144"
    xmlns="http://example.com/product-info"
    xmlns:html="http://www.w3.org/1999/xhtml"
  >
    <name xml:lang="en">Python Perfect IDE</name>
    <description>
      Uses mind-reading technology to anticipate and accommodate
      all user needs in Python development.  Implements all
      <html:code>from __future__ import</html:code> features though
      the year 3000.  Works well with <code>1166</code>.
    </description>
  </product>
  <p:product id="1166" xmlns:p="http://example.com/product-info">
    <p:name>XSLT Perfect IDE</p:name>
    <p:description
      xmlns:html="http://www.w3.org/1999/xhtml"
      xmlns:xl="http://www.w3.org/1999/xlink"
    >
      <p:code>red</p:code>
      <html:code>blue</html:code>
      <html:div>
        <ref xl:type="simple" xl:href="index.xml">A link</ref>
      </html:div>
    </p:description>
  </p:product>
</products>

Reading namespaces with ElementTree

I already covered aspects of the read namespace API of ElementTree in my earlier article. As I mentioned then, ElementTree supports XML namespaces using James Clark's notation directly for element and attribute names. This is a rather different mechanism from most XML processing APIs, and we'll find out how smoothly it works in comparison. Listing 2 displays the local name, namespace and prefix of each element and attribute in a document. I did update to version 1.2c1-20040615 of the software.

Listing 2: ElementTree code to display namespace information for elements and attributes

import sys
from elementtree.ElementTree import ElementTree, XMLTreeBuilder

class ns_tracker_tree_builder(XMLTreeBuilder):
    def __init__(self):
        XMLTreeBuilder.__init__(self)
        self._parser.StartNamespaceDeclHandler = self._start_ns
        self.namespaces = {u'http://www.w3.org/XML/1998/namespace':
                           u'xml'}
 
    def _start_ns(self, prefix, ns):
        self.namespaces[ns] = prefix

def analyze_clark_name(name, nsdict):
    if name[0] == '{':
        ns, local = name[1:].split("}")
    else:
        return None, name, None
    prefix = nsdict[ns]
    if prefix is None:
        prefix = u"!Unknown"
    return prefix, local, ns
        
parser = ns_tracker_tree_builder()
etree = ElementTree()
root = etree.parse(sys.argv[1], parser)

#Create an iterator
iter = root.getiterator()
#Iterate
for elem in iter:
    prefix, local, ns = analyze_clark_name(elem.tag, parser.namespaces)
    print "Element namespace:", repr(ns)
    print "Element local name:", repr(local)
    print "Prefix used for element:", repr(prefix)
    for name, value in elem.items():
        prefix, local, ns = analyze_clark_name(name, parser.namespaces)
        print "Attribute namespace:", repr(ns)
        print "Attribute local name:", repr(local)
        print "Prefix used for attribute:", repr(prefix)

As I discussed in the earlier article, ElementTree does not maintain namespace prefix information. This made my task in this listing much trickier. I found out how to use a specialized class to build the element tree, defined as ns_tracker_tree_builder in listing 2. This class receives expat parse events, but I was only able to figure out how to capture information from the namespace events in a "flat" manner: by updating a single dictionary each time I encounter a namespace declaration event (_start_ns). The problem with this is that all namespace scoping information is lost. I expect this approach will cause oddities in any document where a given namespace is used with more than one prefix at different points. I generally do not recommend such confusing use of namespaces in the first place (see my article Use XML namespaces with care" for more details); in listing 1 I break my own rules because I want to test how XML processing libraries handle even untidy use of namespaces.

I can get a partial solution that maintains prefix information by using my specialized builder. The next challenge is using the resulting dictionary to extract prefixes, namespaces, and local names from the full James Clark notation. I created the function analyze_clark_name for this purpose. The rest of the listing is straightforward ElementTree code that completes the task at hand. The result is given in listing 3.

Listing 3: Output from listing 2 run against listing 1

Element namespace: None
Element local name: 'products'
Prefix used for element: None
Element namespace: 'http://example.com/product-info'
Element local name: 'product'
Prefix used for element: u'p'
Attribute namespace: None
Attribute local name: 'id'
Prefix used for attribute: None
Element namespace: 'http://example.com/product-info'
Element local name: 'name'
Prefix used for element: u'p'
Attribute namespace: 'http://www.w3.org/XML/1998/namespace'
Attribute local name: 'lang'
Prefix used for attribute: u'xml'
Element namespace: 'http://example.com/product-info'
Element local name: 'description'
Prefix used for element: u'p'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'code'
Prefix used for element: u'html'
Element namespace: 'http://example.com/product-info'
Element local name: 'code'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'product'
Prefix used for element: u'p'
Attribute namespace: None
Attribute local name: 'id'
Prefix used for attribute: None
Element namespace: 'http://example.com/product-info'
Element local name: 'name'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'description'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'code'
Prefix used for element: u'p'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'code'
Prefix used for element: u'html'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'div'
Prefix used for element: u'html'
Element namespace: None
Element local name: 'ref'
Prefix used for element: None
Attribute namespace: 'http://www.w3.org/1999/xlink'
Attribute local name: 'href'
Prefix used for attribute: u'xl'
Attribute namespace: 'http://www.w3.org/1999/xlink'
Attribute local name: 'type'
Prefix used for attribute: u'xl'

Scrutinizing this output I found a few problems, which I've marked in bold. As expected they involved the fact that my workaround for recording prefixes does not take into account the scope of namespace declarations and, in effect, always reports the last prefix seen for any given namespace. Notice also the fact that plain strings are returned in most cases rather than Unicode objects. I find this problematic.

ElementTree namespace mutation

The stock list of mutation tasks I've been using to test namespace handling is as follows:

  1. Add a new element in the products namespace, but using no prefix.
  2. Add a new element with a prefix and in the products namespace.
  3. Add a new element that is not in any namespace.
  4. Add a new global attribute in the XHTML namespace.
  5. Add a new global attribute in the special XML namespace.
  6. Add a new attribute in no namespace.
  7. Remove only the code element in the XHTML namespace
  8. Remove a global attribute
  9. Remove an attribute that is not in any namespace

Listing 4 includes code for the various tasks.

Listing 4: ElementTree code for the sample mutation tasks

import sys
from elementtree.ElementTree import ElementTree, SubElement
doc = ElementTree(file='products.xml')

PRODUCT_NS = u'http://example.com/product-info'
HTML_NS = u'http://www.w3.org/1999/xhtml'
XML_NS = u'http://www.w3.org/XML/1998/namespace'
XLINK_NS = u'http://www.w3.org/1999/xlink'

#Task 1 is not really possible

#Task 2
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
new_element = SubElement(product, u'{%s}launch-date'%PRODUCT_NS)

#Task 3
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
new_element = SubElement(product, u'island')

#Task 4
div = doc.getiterator(u'{%s}div'%HTML_NS)[0]
div.set(u'{%s}global'%HTML_NS, u'spam')

#Task 5
div.set(u'{%s}lang'%XML_NS, u'en')

#Task 6
div.set(u'class', u'eggs')

#Task 7
for desc in doc.getiterator(u'{%s}description'%PRODUCT_NS):
    code = desc.getiterator(u'{%s}code'%HTML_NS)[0]
    desc.remove(code)

#Task 8
ref = doc.getiterator(u'ref')[0]
del ref.attrib[u'{%s}href'%XLINK_NS]

#Task 9
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
del product.attrib[u'id']

#write out the modified XML
doc.write(sys.stdout)

  

In general I navigate the tree by using the Clark notation name to create an iterator over all elements with the namespace and local name I want. I didn't bother to check the performance of this approach: it may be faster to use a path expression for this, although my experiments didn't yield a way to use the standard namespace conventions for XPath in ElementTree. Looking through the test routines I saw code along the lines of elem.findall("//{http://spam}egg"), but this is not at all valid XPath. Nevertheless, I tried doc.find(u'.//{%s}product'%PRODUCT_NS) and variations on this (including specifying the full path, and starting with doc.getroot()). No expression I tried returned any results, so I fell back to the tree-wide find/iterate approach. All the ElementTree mutation interfaces worked as expected with the use of Clark notation. Listing 5 is the output from this script. You'll notice immediately the lack of prefix preservation, but with the exception of task 1, which I was unable to accomplish in ElementTree, the results are correct. ElementTree even correctly handles creation of global attributes such as html:global which happen to share their parent's namespace. All other tools I've examined so far have incorrectly omitted a prefix in this case.

Pages: 1, 2

Next Pagearrow