XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

I have covered a lot of tools for processing XML in Python. In general I have deferred discussion of each tool's handling of XML namespaces in order to stick to the basics in the individual treatments. In this article I start to examine the support for XML namespaces in these packages, with a look at SAX and DOM from the standard Python library.

But first, a warning. XML namespaces are largely a matter of shrugging acceptance among most XML users, but they are terribly controversial among XML experts. The controversy is for good reason. Namespaces solve a difficult problem and there are very many approaches to solving this problem, each of which have their pros and cons.

The W3C XML namespaces specification is a compromise and as with all compromises falls a bit short of addressing the needs of each faction. Namespaces have proven, even after all this time, very difficult to smoothly incorporate into the information architecture of XML processing, which translates into the fact that most namespace-processing APIs are clumsy and sprinkled with landmines for the unwary.

The lesson is not to use XML namespaces as a reflex. Think carefully about why and how you plan to use any namespaces you introduce. There are some useful design principles for namespaces that can help reduce problems. These are out of the scope of this article, but I shall be covering them in an upcoming IBM developerWorks article.

Sample Document

In order to exercise the various APIs I will use a rather contrived sample XML document. It exercises the following quirks and qualities:

  • Use of multiple namespaces (with different prefixes).
  • Local name clashes across namespaces.
  • Use of the default namespace.
  • Use of namespaces in mixed content.
  • Elements in no namespace.
  • The special namespace bound to prefix "xml," which need not be declared.
  • What are sometimes called global attributes (i.e., attributes with prefixes and thus explicitly in a namespace).

Listing 1: Sample Document with Many XML Namespace Features and Oddities

<products>
  <product id="1144"
    xmlns="http://example.com/product-info"
    xmlns:html="http://www.w3.org/1999/xhtml"
  >
    <name xml:lang="en">Python Perfect IDE</name>
    <description>
      Uses mind-reading technology to anticipate and accommodate
      all user needs in Python development.  Implements all
      <html:code>from __future__ import</html:code> features though
      the year 3000.  Works well with <code>1166</code>.
    </description>
  </product>
  <p:product id="1166" xmlns:p="http://example.com/product-info">
    <p:name>XSLT Perfect IDE</p:name>
    <p:description
      xmlns:html="http://www.w3.org/1999/xhtml"
      xmlns:xl="http://www.w3.org/1999/xlink"
    >
      <p:code>red</p:code>
      <html:code>blue</html:code>
      <html:div>
        <ref xl:type="simple" xl:href="index.xml">A link</ref>
      </html:div>
    </p:description>
  </p:product>
</products>

  

I'll be looking most importantly at how the various tools report the namespaces, and where document mutation is relevant, how to express namespaces in element, and attribute creation and modification. Namespace prefixes are strictly syntactic conveniences, but as a matter of interest I shall have a look at how the tools handle prefixes.

SAX and Namespaces

The SAX library that comes with Python is based on SAX 2.0, and is fully namespace aware. Namespaces of elements and attributes are reported using a conventional data structure of the form (namespace, local-name), qname. One can extract the prefix from the qname value. The handling of namespaces in SAX is a little bit clumsy, in part from the awkwardness of namespaces themselves and in part from the awkwardness of the original SAX 2 interface in Java. Listing 2 is SAX code that displays the local name, namespace, and prefix of each element and attribute in a document.

Listing 2: SAX Code to Display Namespace Info for Elements and Attributes

import sys
from xml import sax

#Subclass from ContentHandler in order to gain default behaviors
class ns_test_handler(sax.ContentHandler):

    def startElementNS(self, name, qname, attributes):
        (namespace, localname) = name
        prefix = self._split_qname(qname)[0]
        print "Element namespace:", repr(namespace)
        print "Element local name:", repr(localname)
        print "Prefix used for element:", repr(prefix)
        for name, value in attributes.items():
            (namespace, localname) = name
            qname = attributes.getQNameByName(name)
            prefix = self._split_qname(qname)[0]
            print "Attribute namespace:", repr(namespace)
            print "Attribute local name:", repr(localname)
            print "Prefix used for attribute:", repr(prefix)
        return

    def _split_qname(self, qname):
        qname_split = qname.split(':')
        if len(qname_split) == 2:
            prefix, local = qname_split
        else:
            prefix = None
            local = qname_split
        return prefix, local


if __name__ == "__main__":
    parser = sax.make_parser()
    parser.setContentHandler(ns_test_handler())
    parser.setFeature(sax.handler.feature_namespaces, 1)
    parser.setFeature(sax.handler.feature_namespace_prefixes, 1)
    parser.parse(sys.argv[1]) 

At the bottom of this listing I take care to enable a couple of SAX features relating to namespace processing. sax.handler.feature_namespaces instructs the parser to send namespace-aware events such as startElementNS, rather than plain events like startElement. sax.handler.feature_namespace_prefixes instructs the parser to preserve and report namespace prefixes. Without this feature a parser is free to report None for any QName, which means your SAX handler would not have access to the prefixes used in the document. In the handler method for the startElementNS event I show code for extracting all parts of the namespace-related information.

The parameter attributes arrives as an instance of the class xml.sax.xmlreader.AttributesNS, which behaves like a dictionary where the keys are the (namespace, local-name) tuples and the values are the attribute values. There are also a set of special methods for this class that are documented in the Python Library Reference. I use one of these methods, getQNameByName, which takes one of the name tuples and returns the corresponding QName.

The output from this code run against our sample document is as follows:

$ python listing2.py products.xml
Element namespace: None
Element local name: u'products'
Prefix used for element: None
Element namespace: u'http://example.com/product-info'
Element local name: u'product'
Prefix used for element: None
Attribute namespace: None
Attribute local name: u'id'
Prefix used for attribute: None
Element namespace: u'http://example.com/product-info'
Element local name: u'name'
Prefix used for element: None
Attribute namespace: u'http://www.w3.org/XML/1998/namespace'
Attribute local name: u'lang'
Prefix used for attribute: u'xml'
Element namespace: u'http://example.com/product-info'
Element local name: u'description'
Prefix used for element: None
Element namespace: u'http://www.w3.org/1999/xhtml'
Element local name: u'code'
Prefix used for element: u'html'
Element namespace: u'http://example.com/product-info'
Element local name: u'code'
Prefix used for element: None
Element namespace: u'http://example.com/product-info'
Element local name: u'product'
Prefix used for element: u'p'
Attribute namespace: None
Attribute local name: u'id'
Prefix used for attribute: None
Element namespace: u'http://example.com/product-info'
Element local name: u'name'
Prefix used for element: u'p'
Element namespace: u'http://example.com/product-info'
Element local name: u'description'
Prefix used for element: u'p'
Element namespace: u'http://example.com/product-info'
Element local name: u'code'
Prefix used for element: u'p'
Element namespace: u'http://www.w3.org/1999/xhtml'
Element local name: u'code'
Prefix used for element: u'html'
Element namespace: u'http://www.w3.org/1999/xhtml'
Element local name: u'div'
Prefix used for element: u'html'
Element namespace: None
Element local name: u'ref'
Prefix used for element: None
Attribute namespace: u'http://www.w3.org/1999/xlink'
Attribute local name: u'type'
Prefix used for attribute: u'xl'
Attribute namespace: u'http://www.w3.org/1999/xlink'
Attribute local name: u'href'
Prefix used for attribute: u'xl'

As you can see, all the namespace-related values are given as Unicode objects. This is the right thing to do for prefix and name values because these use the Unicode basis for XML names. Namespaces, however, are URIs, and therefore must be represented using ASCII. This means that it is probably OK to use plain strings for namespaces, but I can't argue with the consistency of Unicode across the board.

An important thing to notice is that None is given as the namespace value for elements and attributes that are not in a namespace. Similarly None is given as the prefix for elements and attributes that are not represented with a prefix. These are standard Python conventions and you should never use the empty string to represent such cases. I recommend in general using the constants defined in the Python DOM core interface, both of which are set to None:

from xml.dom import EMPTY_NAMESPACE
from xml.dom import EMPTY_PREFIX

Also notice that the attribute xml:lang is shown as bound to the namespace http://www.w3.org/XML/1998/namespace even though no such namespace is declared. This is because this is a special namespace that is implicitly declared as bound to the prefix xml; it must be handled as such by namespace-compliant tools. There is also a convenience constant in the Python DOM interface for this special namespace, xml.dom.XML_NAMESPACE.

Minidom and Namespaces

Minidom implements a lot of DOM level 2, and accordingly supports namespaces. The API is in some ways even clumsier than SAX's, again through legacy from other languages, but it does make available for reading and edit all the information relating to namespaces. Listing 3 is similar code to Listing 2 and displays all namespace information in the document.

Listing 3: Minidom Code to Display Namespace Info for Elements and Attributes

#Required in Python 2.2, and must be the first import
from __future__ import generators
import sys
from xml.dom import minidom
from xml.dom import Node

def doc_order_iter_filter(node, filter_func):
    """
    Iterates over each node in document order,
    applying the filter function to each in turn,
    starting with the given node, and yielding each node in
    cases where the filter function computes true
    node - the starting point
           (subtree rooted at node will be iterated over document order)
    filter_func - a callable object taking a node and returning
                  true or false
    """
    if filter_func(node):
        yield node
    for child in node.childNodes:
        for cn in doc_order_iter_filter(child, filter_func):
            yield cn
    return


def get_all_elements(node):
    """
    Returns an iterator (using document order) over all element nodes
    that are descendants of the given one
    """
    return doc_order_iter_filter(
        node, lambda n: n.nodeType == Node.ELEMENT_NODE
        )


doc = minidom.parse(sys.argv[1])
for elem in get_all_elements(doc):
    print "Element namespace:", repr(elem.namespaceURI)
    print "Element local name:", repr(elem.localName)
    print "Prefix used for element:", repr(elem.prefix)
    for attr in elem.attributes.values():
        print "Attribute namespace:", repr(attr.namespaceURI)
        print "Attribute local name:", repr(attr.localName)
        print "Prefix used for attribute:", repr(attr.prefix)

The first two functions in the listing are examples of Python generator-driven DOM processing of the sort I introduced and advocated in Generating DOM Magic. The main section uses an iterator over all elements in document order and prints the same namespace information. The method call elem.attributes.values() gets a list of all the attribute node objects for each element. Each attribute node carries all its namespace information as data members.

There are numerous alternative ways to write this loop because Minidom provides a variety of APIs for working with NamedNodeMap objects, which are the way attributes are stored. Some of these methods have special namespace-aware versions. The following snippet shows some examples:

>>> from xml.dom import minidom
>>> doc = minidom.parse('products.xml')
>>> products = doc.getElementsByTagNameNS(
...     u'http://example.com/product-info', u'product'
...     )
>>> perfect_python_ide = products[0]
>>> from pprint import pprint
>>> pprint(perfect_python_ide.attributes.keys())
['xmlns', u'xmlns:html', u'id']
>>> 
>>> pprint(perfect_python_ide.attributes.keysNS())
[('http://www.w3.org/2000/xmlns/', u'html'),
 ('http://www.w3.org/2000/xmlns/', 'xmlns'),
 (None, u'id')]
>>> 
>>> pprint(perfect_python_ide.attributes.items())
[('xmlns', u'http://example.com/product-info'),
 (u'xmlns:html', u'http://www.w3.org/1999/xhtml'),
 (u'id', u'1144')]
>>> 
>>> pprint(perfect_python_ide.attributes.itemsNS())
[(('http://www.w3.org/2000/xmlns/', 'xmlns'),
  u'http://example.com/product-info'),
 (('http://www.w3.org/2000/xmlns/', u'html'),
  u'http://www.w3.org/1999/xhtml'),
 ((None, u'id'), u'1144')]
>>>

See also methods such as setNamedItemNS, getNamedItemNS, and removeNamedItemNS (the latter two only in Python 2.3 or recent PyXML), which provide for namespace-aware retrieval, update and removal of actual attribute node objects.

You probably have noticed that the namespace declarations themselves appear as attributes. Certainly they are attributes in the XML source, because that is how namespace syntax is defined, but you might be surprised to see that the namespace declarations are not removed from the list of attributes in each element. This is because they contain redundant information, given that every node carries its own namespace details. For example, SAX does not include namespace declarations in the attributes by default. This is one of the well-known surprises and sources of debate in DOM Level 2.

As an aside, I noticed that the special namespace declaration attribute local names xmlns are being returned as plain strings rather than Unicode objects, even in the most recent PyXML (and accordingly in all Python versions). This is a bug, though probably a harmless one.

Luckily there is one reliable way to tell namespace declarations from other attributes in DOM: they all use the special, reserved XML namespace http://www.w3.org/2000/xmlns/. This namespace is also available as a standard Python constant, xml.dom.XMLNS_NAMESPACE.

As an example, Listing 4 is a modification of Listing 3 that omits namespace declarations from the reported attributes. Its output is a true match to that of Listing 2.

Listing 4: Minidom Code to Display Namespace Info for Elements and Attributes, Excluding Namespace Declarations

#Required in Python 2.2, and must be the first import
from __future__ import generators
import sys
from xml.dom import minidom
from xml.dom import Node
from xml.dom import XMLNS_NAMESPACE

def doc_order_iter_filter(node, filter_func):
    """
    Iterates over each node in document order,
    applying the filter function to each in turn,
    starting with the given node, and yielding each node in
    cases where the filter function computes true
    node - the starting point
           (subtree rooted at node will be iterated over document order)
    filter_func - a callable object taking a node and returning
                  true or false
    """
    if filter_func(node):
        yield node
    for child in node.childNodes:
        for cn in doc_order_iter_filter(child, filter_func):
            yield cn
    return


def get_all_elements(node):
    """
    Returns an iterator (using document order) over all element nodes
    that are descendants of the given one
    """
    return doc_order_iter_filter(
        node, lambda n: n.nodeType == Node.ELEMENT_NODE
        )


doc = minidom.parse(sys.argv[1])
for elem in get_all_elements(doc):
    print "Element namespace:", repr(elem.namespaceURI)
    print "Element local name:", repr(elem.localName)
    print "Prefix used for element:", repr(elem.prefix)
    for attr in elem.attributes.values():
        if attr.namespaceURI != XMLNS_NAMESPACE:
            print "Attribute namespace:", repr(attr.namespaceURI)
            print "Attribute local name:", repr(attr.localName)
            print "Prefix used for attribute:", repr(attr.prefix)  

Minidom Namespace Mutation

In order to show how to modify a DOM in a namespace-aware manner, I will perform the following tasks:

  1. Add a new element launch-date in the products namespace, but using no prefix.
  2. Add a new element launch-date with a prefix and in the products namespace.
  3. Add a new element that is not in any namespace.
  4. Add a new global attribute in the XHTML namespace.
  5. Add a new global attribute in the special XML namespace.
  6. Add a new attribute in no namespace.
  7. Remove only the code element in the XHTML namespace.
  8. Remove a global attribute.
  9. Remove an attribute that is not in any namespace.

I don't demonstrate modification in place because this can always be done equivalently with an addition and then a removal. Examples of these tasks are as follows:

>>> from xml.dom import minidom
>>> from xml.dom import XML_NAMESPACE
>>> from xml.dom import EMPTY_NAMESPACE
>>> from xml.dom import EMPTY_PREFIX
>>> 
>>> #Set up
...
>>> doc = minidom.parse('products.xml')
>>> products = doc.getElementsByTagNameNS(
...     u'http://example.com/product-info', u'product'
...     )
>>>
>>> #Task 1
...
>>> new_elem = doc.createElementNS(
...     u'http://example.com/product-info', u'launch-date'
...     )
>>> products[0].appendChild(new_elem)
<DOM Element: launch-date at 0x402ac08c>
>>>
>>> #Task 2
...
>>> new_elem = doc.createElementNS(
...     u'http://example.com/product-info', u'p:launch-date'
...     )
>>> products[1].appendChild(new_elem)
<DOM Element: p:launch-date at 0x402cd9ac>
>>>
>>> #Task 3
... 
>>> new_elem = doc.createElementNS(EMPTY_NAMESPACE, u'island')
>>> products[0].appendChild(new_elem)
<DOM Element: island at 0x4030988c>
>>>
>>> #Task 4
... 
>>> divs[0].setAttributeNS(
...     u'http://www.w3.org/1999/xhtml', u'global', u'spam'
...     )
>>>
>>> #Task 5
... 
>>> divs[0].setAttributeNS(XML_NAMESPACE, u'xml:lang', u'en')
>>>
>>> #Task 6
... 
>>> divs[0].setAttributeNS(EMPTY_NAMESPACE, u'class', u'eggs')
>>>
>>> #Task 7
... 
>>> html_codes = products[0].getElementsByTagNameNS(
...     u'http://www.w3.org/1999/xhtml', u'code'
...     )
>>> parent = html_codes[0].parentNode
>>> parent.removeChild(html_codes[0])
<DOM Element: html:code at 0x402d3f2c>
>>>
>>> #Task 8
... 
>>> refs = doc.getElementsByTagNameNS(EMPTY_NAMESPACE, u'ref')
>>> refs[0].removeAttributeNS(u'http://www.w3.org/1999/xlink', u'href')
>>>
>>> #Task 9
... 
>>> products[0].removeAttributeNS(EMPTY_NAMESPACE, u'id')
>>> 

After all this manipulation I re-serialized the updated response as XML, by calling doc.toprettyxml(). I don't display the output of this for reasons of space, but when I examined it I did find a bug. The result of Tasks 4-6 is:

<html:div class="eggs" global="spam" xml:lang="en">

I explicitly asked for the http://www.w3.org/1999/xhtml namespace for the global attribute. By rule this should appear with the html prefix, or equivalent, even though its parent is in the namespace. To be fair this is one of the more obscure and confusing corners of XML namespaces, but it's a bug nevertheless.

More to Come on Namespaces

In this article I covered the basic XML libraries that come with recent versions of Python (and with PyXML). In upcoming articles I will look at the handling of namespaces in third-party tools.

Meanwhile, in the Python-XML world...

Valéry Febvre released PyXMLSec, a set of Python bindings under the GPL for standard XML Security facilities based on the libxml2 implementation in C. It covers XML Signature, XML Encryption, Canonical XML, and Exclusive Canonical XML. A warning from the web site:

"The Python interface has not yet reached the completeness of the C API (currently ~ 300 functions are implemented). Bindings are very young, API can't be considered as mature and may be changed at any time. "
    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Ned Batchelder released handyxml 1.1, a Python module that wraps XML parsers and parsed DOM implementations into objects with added Python features. It includes XPath support. PyXML or 4Suite are required.

I just discovered Dave Kuhlman's Python XML FAQ and How-to, which is really just a few recipes for SAX and DOM (including Python generator) usage. I also found Paul Boddie's Python and XML: An Introduction, which is really an introduction to Minidom. It's a nice introduction, driven by examples, and on reading through it, everything it discusses should still work with current Minidom versions.

I found Sean B. Palmer's pyrple, another RDF API in Python, but based on earlier work by Palmer. It "parses RDF/XML, N3, and N-Triples. It has in-memory storage with API-level querying, experimental marshalling, many utilities, and is small and minimally interdependent." But Palmer admits that it's a bit more hackish than established Python RDF tools and appropriate "if you don't mind getting your hands dirty, and you want something that's small and handy."

Adam Souzis announced a new release (0.2.0) of Rx4RDF and Rhizome. Updates include performance improvements and support for Redland as well as 4Suite. See the announcement.



1 to 1 of 1
  1. Missing line in minidom mutation code
    2004-05-10 07:00:54 Uche Ogbuji
    In cutting and pasting from my Python command line, I missed a line.  Task 4 should look like:


    >>> #Task 4
    ...
    >>> divs = doc.getElementsByTagNameNS(
    ... u'http://www.w3.org/1999/xhtml', u'div'
    ... )
    >>> divs[0].setAttributeNS(
    ... u'http://www.w3.org/1999/xhtml', u'global', u'spam'
    ... )
    >>>


    --Uche
    http://uche.ogbuji.net

1 to 1 of 1