XML Namespaces Support in Python Tools, Part Three
In the last two articles I've discussed namespace handling in Python 2.3's SAX and minidom libraries and in 4Suite. In this article I focus on ElementTree, libxml/Python and PyRXPU. I recommend reading or reviewing those articles first, as well as the earlier articles in this namespace series (part 1 and part 2).
I shall be using, where applicable, the same scenarios I did in the prior articles, based on the same namespace torture test document.
Listing 1: Sample document that uses many XML namespace features and oddities
<products>
<product id="1144"
xmlns="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
>
<name xml:lang="en">Python Perfect IDE</name>
<description>
Uses mind-reading technology to anticipate and accommodate
all user needs in Python development. Implements all
<html:code>from __future__ import</html:code> features though
the year 3000. Works well with <code>1166</code>.
</description>
</product>
<p:product id="1166" xmlns:p="http://example.com/product-info">
<p:name>XSLT Perfect IDE</p:name>
<p:description
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink"
>
<p:code>red</p:code>
<html:code>blue</html:code>
<html:div>
<ref xl:type="simple" xl:href="index.xml">A link</ref>
</html:div>
</p:description>
</p:product>
</products>
I already covered aspects of the read namespace API of ElementTree in my earlier article. As I mentioned then, ElementTree supports XML namespaces using James Clark's notation directly for element and attribute names. This is a rather different mechanism from most XML processing APIs, and we'll find out how smoothly it works in comparison. Listing 2 displays the local name, namespace and prefix of each element and attribute in a document. I did update to version 1.2c1-20040615 of the software.
Listing 2: ElementTree code to display namespace information for elements and attributes
import sys
from elementtree.ElementTree import ElementTree, XMLTreeBuilder
class ns_tracker_tree_builder(XMLTreeBuilder):
def __init__(self):
XMLTreeBuilder.__init__(self)
self._parser.StartNamespaceDeclHandler = self._start_ns
self.namespaces = {u'http://www.w3.org/XML/1998/namespace':
u'xml'}
def _start_ns(self, prefix, ns):
self.namespaces[ns] = prefix
def analyze_clark_name(name, nsdict):
if name[0] == '{':
ns, local = name[1:].split("}")
else:
return None, name, None
prefix = nsdict[ns]
if prefix is None:
prefix = u"!Unknown"
return prefix, local, ns
parser = ns_tracker_tree_builder()
etree = ElementTree()
root = etree.parse(sys.argv[1], parser)
#Create an iterator
iter = root.getiterator()
#Iterate
for elem in iter:
prefix, local, ns = analyze_clark_name(elem.tag, parser.namespaces)
print "Element namespace:", repr(ns)
print "Element local name:", repr(local)
print "Prefix used for element:", repr(prefix)
for name, value in elem.items():
prefix, local, ns = analyze_clark_name(name, parser.namespaces)
print "Attribute namespace:", repr(ns)
print "Attribute local name:", repr(local)
print "Prefix used for attribute:", repr(prefix)
As I discussed in the earlier article, ElementTree does not
maintain namespace prefix information. This made my task in this
listing much trickier. I found out how to use a specialized class to
build the element tree, defined
as ns_tracker_tree_builder in listing 2. This class
receives expat parse events, but I was only able to figure out how to
capture information from the namespace events in a "flat" manner: by
updating a single dictionary each time I encounter a namespace
declaration event (_start_ns). The problem with this is
that all namespace scoping information is lost. I expect this
approach will cause oddities in any document where a given namespace
is used with more than one prefix at different points. I generally do
not recommend such confusing use of namespaces in the first place (see
my article Use
XML namespaces with care" for more details); in listing 1 I break
my own rules because I want to test how XML processing libraries
handle even untidy use of namespaces.
I can get a partial solution that maintains prefix information by
using my specialized builder. The next challenge is using the
resulting dictionary to extract prefixes, namespaces, and local names
from the full James Clark notation. I created the
function analyze_clark_name for this purpose. The rest
of the listing is straightforward ElementTree code that completes the
task at hand. The result is given in listing 3.
Listing 3: Output from listing 2 run against listing 1
Element namespace: None
Element local name: 'products'
Prefix used for element: None
Element namespace: 'http://example.com/product-info'
Element local name: 'product'
Prefix used for element: u'p'
Attribute namespace: None
Attribute local name: 'id'
Prefix used for attribute: None
Element namespace: 'http://example.com/product-info'
Element local name: 'name'
Prefix used for element: u'p'
Attribute namespace: 'http://www.w3.org/XML/1998/namespace'
Attribute local name: 'lang'
Prefix used for attribute: u'xml'
Element namespace: 'http://example.com/product-info'
Element local name: 'description'
Prefix used for element: u'p'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'code'
Prefix used for element: u'html'
Element namespace: 'http://example.com/product-info'
Element local name: 'code'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'product'
Prefix used for element: u'p'
Attribute namespace: None
Attribute local name: 'id'
Prefix used for attribute: None
Element namespace: 'http://example.com/product-info'
Element local name: 'name'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'description'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'code'
Prefix used for element: u'p'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'code'
Prefix used for element: u'html'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'div'
Prefix used for element: u'html'
Element namespace: None
Element local name: 'ref'
Prefix used for element: None
Attribute namespace: 'http://www.w3.org/1999/xlink'
Attribute local name: 'href'
Prefix used for attribute: u'xl'
Attribute namespace: 'http://www.w3.org/1999/xlink'
Attribute local name: 'type'
Prefix used for attribute: u'xl'
Scrutinizing this output I found a few problems, which I've marked in bold. As expected they involved the fact that my workaround for recording prefixes does not take into account the scope of namespace declarations and, in effect, always reports the last prefix seen for any given namespace. Notice also the fact that plain strings are returned in most cases rather than Unicode objects. I find this problematic.
The stock list of mutation tasks I've been using to test namespace handling is as follows:
code element in the XHTML namespaceListing 4 includes code for the various tasks.
Listing 4: ElementTree code for the sample mutation tasks
import sys
from elementtree.ElementTree import ElementTree, SubElement
doc = ElementTree(file='products.xml')
PRODUCT_NS = u'http://example.com/product-info'
HTML_NS = u'http://www.w3.org/1999/xhtml'
XML_NS = u'http://www.w3.org/XML/1998/namespace'
XLINK_NS = u'http://www.w3.org/1999/xlink'
#Task 1 is not really possible
#Task 2
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
new_element = SubElement(product, u'{%s}launch-date'%PRODUCT_NS)
#Task 3
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
new_element = SubElement(product, u'island')
#Task 4
div = doc.getiterator(u'{%s}div'%HTML_NS)[0]
div.set(u'{%s}global'%HTML_NS, u'spam')
#Task 5
div.set(u'{%s}lang'%XML_NS, u'en')
#Task 6
div.set(u'class', u'eggs')
#Task 7
for desc in doc.getiterator(u'{%s}description'%PRODUCT_NS):
code = desc.getiterator(u'{%s}code'%HTML_NS)[0]
desc.remove(code)
#Task 8
ref = doc.getiterator(u'ref')[0]
del ref.attrib[u'{%s}href'%XLINK_NS]
#Task 9
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
del product.attrib[u'id']
#write out the modified XML
doc.write(sys.stdout)
In general I navigate the tree by using the Clark notation name to
create an iterator over all elements with the namespace and local name
I want. I didn't bother to check the performance of this approach: it
may be faster to use a path expression for this, although my
experiments didn't yield a way to use the standard namespace
conventions for XPath in ElementTree. Looking through the test
routines I saw code along the lines
of elem.findall("//{http://spam}egg"), but this is not at
all valid XPath. Nevertheless, I tried
doc.find(u'.//{%s}product'%PRODUCT_NS) and variations on
this (including specifying the full path, and starting with
doc.getroot()). No expression I tried returned any
results, so I fell back to the tree-wide find/iterate approach. All
the ElementTree mutation interfaces worked as expected with the use of
Clark notation. Listing 5 is the output from this script. You'll
notice immediately the lack of prefix preservation, but with the
exception of task 1, which I was unable to accomplish in ElementTree,
the results are correct. ElementTree even correctly handles creation
of global attributes such as html:global which happen to
share their parent's namespace. All other tools I've examined so far
have incorrectly omitted a prefix in this case.
|
Listing 5: Output from listing 4
<products>
<ns0:product xmlns:ns0="http://example.com/product-info">
<ns0:name xml:lang="en">Python Perfect IDE</ns0:name>
<ns0:description>
Uses mind-reading technology to anticipate and accommodate
all user needs in Python development. Implements all
<ns0:code>1166</ns0:code>.
</ns0:description>
<ns0:launch-date /><island /></ns0:product>
<ns1:product id="1166" xmlns:ns1="http://example.com/product-info">
<ns1:name>XSLT Perfect IDE</ns1:name>
<ns1:description>
<ns1:code>red</ns1:code>
<html:div class="eggs" html:global="spam" xml:lang="en"
xmlns:html="http://www.w3.org/1999/xhtml">
<ref ns3:type="simple"
xmlns:ns3="http://www.w3.org/1999/xlink">A link</ref>
</html:div>
</ns1:description>
</ns1:product>
</products>
As I discussed in my article about libxml there are several available mechanisms for processing XML, including SAX and DOM variations. I focused on the more unusual API, XmlTextReader, and in this discussion of namespace processing I shall continue to focus on this API, which means that I'll only worry about how to read namespace information. You should be able to perform mutation using similar DOM idioms to those I presented in the first namespace article. Listing 6 is the XmlTextReader equivalent of listing 2.
Listing 6: libxml code to display namespace information for elements and attributes
import sysimport cStringIO
import libxml2
XMLNS_NS = 'http://www.w3.org/2000/xmlns/'
XMLREADER_START_ELEMENT_NODE_TYPE = 1
input = open(sys.argv[1])
input_source = libxml2.inputBuffer(input)
reader = input_source.newTextReader("urn:bogus")
while reader.Read():
if reader.NodeType() == XMLREADER_START_ELEMENT_NODE_TYPE:
print "Element namespace:", repr(reader.NamespaceUri())
print "Element local name:", repr(reader.LocalName())
print "Prefix used for element:", repr(reader.Prefix())
while reader.MoveToNextAttribute():
#Ignore namespace declarations
if reader.NamespaceUri() != XMLNS_NS:
print "Attribute namespace:", repr(reader.NamespaceUri())
print "Attribute local name:", repr(reader.LocalName())
print "Prefix used for attribute:", repr(reader.Prefix())
Besides the fact that again the element names and prefixes are not returned as Unicode objects, the results are as expected.
PyRXPU is part of the PyRXP package, and the only part I recommend using, as I discussed at length in my recent article on PyRXP. PyRXP is "non-Unicode" by default, but this default configuration is not an XML parser at all. You do have to get a CVS version of the package in order to use PyRXPU. The latest release, 0.9, does not include it. I provided details for installing from CVS in my earlier article. I did try to update to more recent CVS code this time, but my attempts to use PyRXPU in the latest PyRXP code resulted in core dumps on my Dell Inspiron 8600 running Fedora Core 2, so I reverted to the CVS code I used back in February. I didn't see anything in the CVS logs since February indicating any significant changes in namespace handling, so I assumed this would still be a current test.
By default PyRXP doesn't do any special namespace processing and
returns namespace declarations as regular attributes. There are
several parser parameters regarding namespace processing.
One, ReturnNamespaceAttributes, is described strangely in
the documentation as not returning XML namespace declarations by
default. This seems to be incorrect. The second,
XMLNamespaces, is described in the documentation thusly:
If this is on, the parser processes namespace declarations (see below). Namespace declarations are not returned as part of the list of attributes on an element.
I wasn't able to find whatever passage might have been referenced in the "see below" phrase: this sentence was pretty much the last one concerning namespaces in the document. I came to wish I could find more on namespaces once I tried out namespace processing in listing 7.
Listing 7: Code to parse a document in namespace processing mode
import sys
import pyRXPU
parser = pyRXPU.Parser()
parser.XMLNamespaces = 1
doc_source = open(sys.argv[1]).read()
doc = parser.parse(doc_source)
import pprint
pprint.pprint(doc)
The result of running this against listing 1 is very odd:
Listing 8: results of namespace-aware reading of listing 1 in PyRXPU
(u'products',
None,
[u'\n ',
(u'product',
{u'id': u'1144'},
[u'\n ',
(u'name', {u'xml:lang': u'en'}, [u'Python Perfect IDE'], None),
u'\n ',
(u'description',
None,
[u'\n Uses mind-reading technology to anticipate and '
'accommodate\n all user needs in Python development. '
'Implements all\n ',
(u'html:code', None, [u'from __future__ import'], None),
u' features though\n the year 3000. Works well with ',
(u'code', None, [u'1166'], None),
u'.\n '],
None),
u'\n '],
None),
u'\n ',
(u'p:product',
{u'id': u'1166'},
[u'\n ',
(u'p:name', None, [u'XSLT Perfect IDE'], None),
u'\n ',
(u'p:description',
None,
[u'\n ',
(u'p:code', None, [u'red'], None),
u'\n ',
(u'html:code', None, [u'blue'], None),
u'\n ',
(u'html:div',
None,
[u'\n ',
(u'ref',
{u'xl:type': u'simple', u'xl:href': u'index.xml'},
[u'A link'],
None),
u'\n '],
None),
u'\n '],
None),
u'\n '],
None),
u'\n'],
None)
The important information -- the namespaces -- is omitted while the
unimportant details -- the prefixes -- are included as part of element
names. This makes namespace processing very difficult. I tried a lot
of tweaking and other options to try to get all the information needed
for ready namespace processing without having to knit it all back
together by hand after turning off the namespace option (the only
difference upon omitting the parser.XMLNamespaces = 1
line is that namespace declarations are returned as attributes). In
the end I was not really able to tackle any of the namespace reading
or mutation tasks without processing namespaces entirely by hand
(which you can do with any toolkit, namespace aware or no), and I
conclude that PyRXPU does not really support namespace processing.
In this batch of namespace tests the results have been a mixed bag. ElementTree supports namespaces properly, but makes it very hard to work with prefixes, which is acceptable given that prefixes are a mere syntactic convenience. I would hesitate to use ElementTree where I needed the convenience of preserved prefixes. PyRXPU seems to either report namespace declarations literally, without any API benefits, or discard the namespaces information altogether, which is as much to say it doesn't support namespace processing. libxml, as one expects from such a comprehensive library, handles namespaces effortlessly. I barely scratched the surface in this article of how to process namespaces in libxml, but I do show the SAX and DOM approaches in my earlier article. I expect to wrap up this series on namespace processing next by looking at how some data binding tools handle namespaces.
It has been a busy month for my colleagues in the Python-XML community, including work by Brett Hartshorn on yet another small Python DOM implementation, xmlapi 0.2.1. It's billed as an "even smaller XML DOM implementation than Python's standard xml.dom.minidom" and claims performance and feature improvements over minidom.
Philippe Normand debuted XMLObject 0.0.2, a data binding tool which allows you to map from customized Python classes to XML and vice versa. See the announcement.
Fredrik Lundh announced ElementTree 1.2. I'm somewhat confused at this point as to whether the package is supposed to be called "ElementTree" or "elementtree", but I think current clues suggest the former. This release just makes official the various experimental features such as XPath support which I have already discussed. 1.2 final appeared after I wrapped up this article and the namespace discussion is based on the most recent beta rather than the final 1.2 release, but I expect not much has changed in between the two. See the announcement.
Brian Quinlan, who has also been busy helping organize the Vancouver Python Workshop, announced Pyana 0.9.0, the latest release of his Python interface to the Xalan-C XSLT processor. Changes include updated for Xalan 1.8/Xerces 2.5, basic support for tracing, and removal of the transform-to-DOM support, with promises of a better replacement in future. See the announcement.
Magnus Lie Hetland has updated Atox to version 0.5. Atox allows you to write custom scripts for converting plain text into XML. You define the text to XML binding using a simple XML language. It's meant to be used from the command line. Changes since 0.1 include language improvements, added support for config files and XSLT fragments in Atox format files. . See the full announcement.
Also in Python and XML | |
Should Python and XML Coexist? | |
Michael Twomey announced pygenx 0.5.2, a wrapper for Tim Bray's XML generation library Genx. Genx is a C library and its output is canonical XML. PyGenx wraps the full API. See the announcement
Christof Hoeke announced pyXSLdoc 0.51, a Python tool for generating documentation of XSLT code in a similar approach to Javadoc. Version 0.60b is actually available, but the most recent announcement is for 0.52.
I recently discovered the Simple Objects from XML (SOX) module buried deep within the Python Enterprise Application Kit (PEAK). PEAK is a components toolkit for large-scale applications (the developers claim it is as powerful as J2EE but not as complex). SOX is another XML data binding toolkit which uses SAX events to build an object the user can define based on classes set up for namespace aware or namespace oblivious usage.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.