
XML Namespaces Support in Python Tools, Part Three
In the last two articles I've discussed namespace handling in Python 2.3's SAX and minidom libraries and in 4Suite. In this article I focus on ElementTree, libxml/Python and PyRXPU. I recommend reading or reviewing those articles first, as well as the earlier articles in this namespace series (part 1 and part 2).
I shall be using, where applicable, the same scenarios I did in the prior articles, based on the same namespace torture test document.
Listing 1: Sample document that uses many XML namespace features and oddities
<products>
<product id="1144"
xmlns="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
>
<name xml:lang="en">Python Perfect IDE</name>
<description>
Uses mind-reading technology to anticipate and accommodate
all user needs in Python development. Implements all
<html:code>from __future__ import</html:code> features though
the year 3000. Works well with <code>1166</code>.
</description>
</product>
<p:product id="1166" xmlns:p="http://example.com/product-info">
<p:name>XSLT Perfect IDE</p:name>
<p:description
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink"
>
<p:code>red</p:code>
<html:code>blue</html:code>
<html:div>
<ref xl:type="simple" xl:href="index.xml">A link</ref>
</html:div>
</p:description>
</p:product>
</products>
Reading namespaces with ElementTree
I already covered aspects of the read namespace API of ElementTree in my earlier article. As I mentioned then, ElementTree supports XML namespaces using James Clark's notation directly for element and attribute names. This is a rather different mechanism from most XML processing APIs, and we'll find out how smoothly it works in comparison. Listing 2 displays the local name, namespace and prefix of each element and attribute in a document. I did update to version 1.2c1-20040615 of the software.
Listing 2: ElementTree code to display namespace information for elements and attributes
import sys
from elementtree.ElementTree import ElementTree, XMLTreeBuilder
class ns_tracker_tree_builder(XMLTreeBuilder):
def __init__(self):
XMLTreeBuilder.__init__(self)
self._parser.StartNamespaceDeclHandler = self._start_ns
self.namespaces = {u'http://www.w3.org/XML/1998/namespace':
u'xml'}
def _start_ns(self, prefix, ns):
self.namespaces[ns] = prefix
def analyze_clark_name(name, nsdict):
if name[0] == '{':
ns, local = name[1:].split("}")
else:
return None, name, None
prefix = nsdict[ns]
if prefix is None:
prefix = u"!Unknown"
return prefix, local, ns
parser = ns_tracker_tree_builder()
etree = ElementTree()
root = etree.parse(sys.argv[1], parser)
#Create an iterator
iter = root.getiterator()
#Iterate
for elem in iter:
prefix, local, ns = analyze_clark_name(elem.tag, parser.namespaces)
print "Element namespace:", repr(ns)
print "Element local name:", repr(local)
print "Prefix used for element:", repr(prefix)
for name, value in elem.items():
prefix, local, ns = analyze_clark_name(name, parser.namespaces)
print "Attribute namespace:", repr(ns)
print "Attribute local name:", repr(local)
print "Prefix used for attribute:", repr(prefix)
As I discussed in the earlier article, ElementTree does not
maintain namespace prefix information. This made my task in this
listing much trickier. I found out how to use a specialized class to
build the element tree, defined
as ns_tracker_tree_builder in listing 2. This class
receives expat parse events, but I was only able to figure out how to
capture information from the namespace events in a "flat" manner: by
updating a single dictionary each time I encounter a namespace
declaration event (_start_ns). The problem with this is
that all namespace scoping information is lost. I expect this
approach will cause oddities in any document where a given namespace
is used with more than one prefix at different points. I generally do
not recommend such confusing use of namespaces in the first place (see
my article Use
XML namespaces with care" for more details); in listing 1 I break
my own rules because I want to test how XML processing libraries
handle even untidy use of namespaces.
I can get a partial solution that maintains prefix information by
using my specialized builder. The next challenge is using the
resulting dictionary to extract prefixes, namespaces, and local names
from the full James Clark notation. I created the
function analyze_clark_name for this purpose. The rest
of the listing is straightforward ElementTree code that completes the
task at hand. The result is given in listing 3.
Listing 3: Output from listing 2 run against listing 1
Element namespace: None
Element local name: 'products'
Prefix used for element: None
Element namespace: 'http://example.com/product-info'
Element local name: 'product'
Prefix used for element: u'p'
Attribute namespace: None
Attribute local name: 'id'
Prefix used for attribute: None
Element namespace: 'http://example.com/product-info'
Element local name: 'name'
Prefix used for element: u'p'
Attribute namespace: 'http://www.w3.org/XML/1998/namespace'
Attribute local name: 'lang'
Prefix used for attribute: u'xml'
Element namespace: 'http://example.com/product-info'
Element local name: 'description'
Prefix used for element: u'p'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'code'
Prefix used for element: u'html'
Element namespace: 'http://example.com/product-info'
Element local name: 'code'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'product'
Prefix used for element: u'p'
Attribute namespace: None
Attribute local name: 'id'
Prefix used for attribute: None
Element namespace: 'http://example.com/product-info'
Element local name: 'name'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'description'
Prefix used for element: u'p'
Element namespace: 'http://example.com/product-info'
Element local name: 'code'
Prefix used for element: u'p'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'code'
Prefix used for element: u'html'
Element namespace: 'http://www.w3.org/1999/xhtml'
Element local name: 'div'
Prefix used for element: u'html'
Element namespace: None
Element local name: 'ref'
Prefix used for element: None
Attribute namespace: 'http://www.w3.org/1999/xlink'
Attribute local name: 'href'
Prefix used for attribute: u'xl'
Attribute namespace: 'http://www.w3.org/1999/xlink'
Attribute local name: 'type'
Prefix used for attribute: u'xl'
Scrutinizing this output I found a few problems, which I've marked in bold. As expected they involved the fact that my workaround for recording prefixes does not take into account the scope of namespace declarations and, in effect, always reports the last prefix seen for any given namespace. Notice also the fact that plain strings are returned in most cases rather than Unicode objects. I find this problematic.
ElementTree namespace mutation
The stock list of mutation tasks I've been using to test namespace handling is as follows:
- Add a new element in the products namespace, but using no prefix.
- Add a new element with a prefix and in the products namespace.
- Add a new element that is not in any namespace.
- Add a new global attribute in the XHTML namespace.
- Add a new global attribute in the special XML namespace.
- Add a new attribute in no namespace.
- Remove only the
codeelement in the XHTML namespace - Remove a global attribute
- Remove an attribute that is not in any namespace
Listing 4 includes code for the various tasks.
Listing 4: ElementTree code for the sample mutation tasks
import sys
from elementtree.ElementTree import ElementTree, SubElement
doc = ElementTree(file='products.xml')
PRODUCT_NS = u'http://example.com/product-info'
HTML_NS = u'http://www.w3.org/1999/xhtml'
XML_NS = u'http://www.w3.org/XML/1998/namespace'
XLINK_NS = u'http://www.w3.org/1999/xlink'
#Task 1 is not really possible
#Task 2
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
new_element = SubElement(product, u'{%s}launch-date'%PRODUCT_NS)
#Task 3
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
new_element = SubElement(product, u'island')
#Task 4
div = doc.getiterator(u'{%s}div'%HTML_NS)[0]
div.set(u'{%s}global'%HTML_NS, u'spam')
#Task 5
div.set(u'{%s}lang'%XML_NS, u'en')
#Task 6
div.set(u'class', u'eggs')
#Task 7
for desc in doc.getiterator(u'{%s}description'%PRODUCT_NS):
code = desc.getiterator(u'{%s}code'%HTML_NS)[0]
desc.remove(code)
#Task 8
ref = doc.getiterator(u'ref')[0]
del ref.attrib[u'{%s}href'%XLINK_NS]
#Task 9
product = doc.getiterator(u'{%s}product'%PRODUCT_NS)[0]
del product.attrib[u'id']
#write out the modified XML
doc.write(sys.stdout)
In general I navigate the tree by using the Clark notation name to
create an iterator over all elements with the namespace and local name
I want. I didn't bother to check the performance of this approach: it
may be faster to use a path expression for this, although my
experiments didn't yield a way to use the standard namespace
conventions for XPath in ElementTree. Looking through the test
routines I saw code along the lines
of elem.findall("//{http://spam}egg"), but this is not at
all valid XPath. Nevertheless, I tried
doc.find(u'.//{%s}product'%PRODUCT_NS) and variations on
this (including specifying the full path, and starting with
doc.getroot()). No expression I tried returned any
results, so I fell back to the tree-wide find/iterate approach. All
the ElementTree mutation interfaces worked as expected with the use of
Clark notation. Listing 5 is the output from this script. You'll
notice immediately the lack of prefix preservation, but with the
exception of task 1, which I was unable to accomplish in ElementTree,
the results are correct. ElementTree even correctly handles creation
of global attributes such as html:global which happen to
share their parent's namespace. All other tools I've examined so far
have incorrectly omitted a prefix in this case.
Pages: 1, 2 |