
XML Namespaces Support in Python Tools, Part Two
In the last article I discussed namespace handling in Python 2.3's SAX and minidom libraries. As I pointed out there are a lot of pitfalls and oddities involved with processing namespaces, and I will continue to give the same treatment to the namespace support in third party Python libraries. In this article I shall focus on the various libraries packaged in 4Suite. If you need background on 4Suite, see my earlier article "A Tour of 4Suite ". I did briefly cover how to express namespaces for use in 4XPath in that article, but in this one I will explore different angles on the topic.
The Namespace Torture Sample, Revisited
Listing 1 is the same sample document I used in the last article. If you haven't read that article I recommend you at least review it for discussion of the aspects of namespaces I exercise in this rather contrived example.
Listing 1: Sample document that uses many XML namespace features and oddities<products>
<product id="1144"
xmlns="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
>
<name xml:lang="en">Python Perfect IDE</name>
<description>
Uses mind-reading technology to anticipate and accommodate
all user needs in Python development. Implements all
<html:code>from __future__ import</html:code> features though
the year 3000. Works well with <code>1166</code>.
</description>
</product>
<p:product id="1166" xmlns:p="http://example.com/product-info">
<p:name>XSLT Perfect IDE</p:name>
<p:description
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink"
>
<p:code>red</p:code>
<html:code>blue</html:code>
<html:div>
<ref xl:type="simple" xl:href="index.xml">A link</ref>
</html:div>
</p:description>
</p:product>
</products>
4Suite's XPath and Namespaces (Reading)
4Suite implements the natural namespace support in specifications such as XPath and XUpdate, which can be used respectively to exercise the namespace reading and mutation tasks I set up in the last article. Listing 2 uses XPath to display the local name, namespace and prefix of each element and attribute in a document.
Listing 2: 4Suite/XPath code to display namespace information for elements and attributesimport sys
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml.Xslt import PatternList
from Ft.Xml.Domlette import NonvalidatingReader
#Compile needed XPath expressions
NS_NODES_EXPR = Compile('//*|//@*')
NSURI_EXPR = Compile('namespace-uri()')
LNAME_EXPR = Compile('local-name()')
PREFIX_EXPR = Compile('substring-before(name(), ":")')
#XPattern is syntactically a subset of XPath
IS_ATTR_PAT = '@*'
#Second parameter is a dictionary of prefix to namespace mappings
plist = PatternList([IS_ATTR_PAT], {})
#Read in the file
doc = NonvalidatingReader.parseUri(sys.argv[1])
#Set up the XPath context with the docment read in
context = Context(doc)
#Extract all the element and attribute nodes in the doc
nodes = NS_NODES_EXPR.evaluate(context)
for node in nodes:
context = Context(node)
#Use XPattern to determine the current node type
if plist.lookup(node):
node_type_str = 'attribute'
else:
node_type_str = 'element'
#Output the namespace details fo rthe current node
nsuri = NSURI_EXPR.evaluate(context)
print node_type_str, ' namespace:', repr(nsuri)
lname = LNAME_EXPR.evaluate(context)
print node_type_str, ' local name:', repr(lname)
prefix = PREFIX_EXPR.evaluate(context)
print 'Prefix used for', node_type_str, repr(prefix)
This code is also a bit contrived in order to illustrate how to
perform all the subtasks using XPath and XPattern, along the lines
of the usual division of labor where the former is used for
gathering nodes and processing the basic data model and the latter
is used for checking to see whether nodes conform to certain
rules. Using an XPath expression I gather up all elements and
attributes, and they are naturally returned to Python in document
order. I then iterate over the nodes checking each against an
XPattern to determine whether it is an attribute. XPath provides
functions to get the namespace and local name for a given node,
but not one for extracting the prefix. This is easily done,
though, by using the substring-before function and
the syntactic limitations on colons in QNames.
The output from this code run against our sample document is as follows:
$ python listing2.py products.xml
element namespace: u''
element local name: u'products'
Prefix used for element u''
element namespace: u'http://example.com/product-info'
element local name: u'product'
Prefix used for element u''
attribute namespace: u''
attribute local name: u'id'
Prefix used for attribute u''
element namespace: u'http://example.com/product-info'
element local name: u'name'
Prefix used for element u''
attribute namespace: u'http://www.w3.org/XML/1998/namespace'
attribute local name: u'lang'
Prefix used for attribute u'xml'
element namespace: u'http://example.com/product-info'
element local name: u'description'
Prefix used for element u''
element namespace: u'http://www.w3.org/1999/xhtml'
element local name: u'code'
Prefix used for element u'html'
element namespace: u'http://example.com/product-info'
element local name: u'code'
Prefix used for element u''
element namespace: u'http://example.com/product-info'
element local name: u'product'
Prefix used for element u'p'
attribute namespace: u''
attribute local name: u'id'
Prefix used for attribute u''
element namespace: u'http://example.com/product-info'
element local name: u'name'
Prefix used for element u'p'
element namespace: u'http://example.com/product-info'
element local name: u'description'
Prefix used for element u'p'
element namespace: u'http://example.com/product-info'
element local name: u'code'
Prefix used for element u'p'
element namespace: u'http://www.w3.org/1999/xhtml'
element local name: u'code'
Prefix used for element u'html'
element namespace: u'http://www.w3.org/1999/xhtml'
element local name: u'div'
Prefix used for element u'html'
element namespace: u''
element local name: u'ref'
Prefix used for element u''
attribute namespace: u'http://www.w3.org/1999/xlink'
attribute local name: u'type'
Prefix used for attribute u'xl'
attribute namespace: u'http://www.w3.org/1999/xlink'
attribute local name: u'href'
Prefix used for attribute u'xl'
The output is all as expected except that you'll notice that null
namespaces and prefixes are represented using
u'' rather than the Python convention
None. This is natural enough given that XPath is not
a Python specification, and it is usually not problematic because
you almost always know when an XPath could return
a u'' that would need to be fixed up to
None for further processing in Python.
4Suite's XUpdate and Namespaces (Mutation)
XUpdate is a community specification for using an XML vocabulary to express modifications to XML documents. It is supported by many XML processing tools, especially in the open source category; and 4Suite provides an XUpdate library as well as a command line tool which applies XUpdate and can, for example, be used as a patching utility for XML. In order to show how to use XUpdate to make namespace-aware modifications, I shall perform the following tasks, which are the same as in the last article:
- Add a new element in the products namespace, but using no prefix.
- Add a new element with a prefix and in the products namespace.
- Add a new element that is not in any namespace.
- Add a new global attribute in the XHTML namespace.
- Add a new global attribute in the special XML namespace.
- Add a new attribute in no namespace.
- Remove only the
codeelement in the XHTML namespace - Remove a global attribute
- Remove an attribute that is not in any namespace
I don't demonstrate modification in place because this can always be done equivalently with an addition and then a removal. Listing 3 shows how these tasks can be performed in XUpdate.
Listing 3: XUpdate script to make namespace-aware additions and removals of elements and attributes<xup:modifications version="1.0"
xmlns:xup="http://www.xmldb.org/xupdate"
xmlns:p="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink"
>
<!-- Task 1 -->
<xup:append select="/products/p:product[1]">
<xup:element
name="launch-date"
namespace="http://example.com/product-info"/>
</xup:append>
<!-- Task 2 -->
<xup:append select="/products/p:product[1]">
<xup:element
name="p:launch-date"
namespace="http://example.com/product-info"/>
</xup:append>
<!-- Can also be accomplished using literal result elements:
<xup:append select="/products/p:product[1]">
<p:launch-date/>
</xup:append>
-->
<!-- Task 3 -->
<xup:append select="/products/p:product[1]">
<xup:element name="island"/>
</xup:append>
<!-- Can also be accomplished using literal result elements:
<xup:append select="/products/p:product[1]">
<island/>
</xup:append>
-->
<!-- Task 4 -->
<xup:append select="/products/p:product/p:description/html:div">
<xup:attribute name="global"
namespace="http://www.w3.org/1999/xhtml">spam</xup:attribute>
</xup:append>
<!-- Task 5 -->
<xup:append select="/products/p:product/p:description/html:div">
<xup:attribute name="xml:lang">en</xup:attribute>
</xup:append>
<!-- Task 6 -->
<xup:append select="/products/p:product/p:description/html:div">
<xup:attribute name="class">eggs</xup:attribute>
</xup:append>
<!-- Task 7 -->
<xup:remove select="//html:code"/>
<!-- Task 8 -->
<xup:remove select="/products/p:product/p:description/html:div/ref/@xl:href"/>
<!-- Task 9 -->
<xup:remove select="/products/p:product[1]/@id"/>
</xup:modifications>
If you're familiar with XSLT, then you'll see the resemblance of
XUpdate at first glance. The envelope element for modifications
expressed in XUpdate is xup:modifications, similar to
xsl:transform or xsl:stylesheet. The
namespace declarations on this element assign prefixes for use in
the XUpdate script and have no connection to the prefixes
used in the document being modified (the source document),
even though they happen to be the same. If you want to access
elements in a namespace declared as the default in the source
document, then just as in XSLT you must declare and use a prefix
for the namespace in the XUpdate script.
Each modification request is expressed as an XUpdate instruction.
This example demonstrates xup:append
and xup:remove. There are other instructions
providing types of modification such as
xup:insert-before
xup:update and there are also control constructs such as
xup:if, which is similar to
xsl:if. Instructions usually have
a select attribute containing an XPath expression
that specifies the node to be used as a reference for
modification. In the case of
xup:append, select specifies a node
after which some new XML will be appended. In the case of
xup:remove, select identifies nodes to
be removed. When an instruction needs to specify a chunk of XML
to be used in the modification it is expressed as the content of
the instructions in a similar fashion to XSLT templates. In the
case of xup:append this template expresses the chunk
of XML to be inserted into the document. In order to generate
elements and attributes XUpdate provides output instructions such
as xup:element and xup:attribute, which are very similar to their XSLT
equivalents. In another idea borrowed from XSLT, XUpdate allows
you to create element by placing literal result elements in the
templates. If you'd like to get a closer look at XUpdate, the
best way is by browsing the very clear examples in the XUpdate
Use Cases compiled by Kimbro Staken. See listing 4 for Python
code that can be used to apply an XUpdate script. It's a
simplified version of the code for the 4xupdate command line.
import sys
from Ft.Xml import XUpdate
from Ft.Xml import Domlette, InputSource
from Ft.Lib import Uri
#Set up reader objects for parsing the XML files
reader = Domlette.NonvalidatingReader
xureader = XUpdate.Reader()
#Parse the source file
source_uri = Uri.OsPathToUri(sys.argv[1], attemptAbsolute=1)
source = reader.parseUri(source_uri)
#Parse the XUpdate file
xupdate_uri = Uri.OsPathToUri(sys.argv[2], attemptAbsolute=1)
isrc = InputSource.DefaultFactory.fromUri(xupdate_uri)
xupdate = xureader.fromSrc(isrc)
#Set up the XUpdate processor and run against the source file
#The Domlette for the source is modified in place
processor = XUpdate.Processor()
processor.execute(source, xupdate)
#Print the updated DOM node to standard output
Domlette.Print(source)
Notice the use of Uri.OsPathToUri to convert file
system paths to proper URIs for use in 4Suite. I strongly
recommend this convention as one way to help minimize confusion
between file specifications and URIs -- the basis of many
frequently asked questions. The XUpdate.Processor
class defines the environment for running XUpdate commands and
execute() is the method that actually kicks off the
processing. It operates on a Domlette instance, modifying it in
place (so be careful when using using XUpdate in this way). I
print the updated document object to standard output using
Domlette.Print.
This XUpdate worked fine with the latest CVS version of 4Suite, but the attribute additions did not work with that last packaged release, 1.0a3. It turns out that Mike Brown restored the ability to append attributes just last month. If you need this capability you'll need to use the CVS version until the next packaged release. The following snippet illustrates how to run the test script, and the output result.
$ python listing4.py products.xml listing3.xup
<?xml version="1.0" encoding="UTF-8"?>
<products xmlns:p="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink"
>
<product xmlns="http://example.com/product-info">
<name xml:lang="en">Python Perfect IDE</name>
<description>
Uses mind-reading technology to anticipate and accommodate
all user needs in Python development. Implements all
features though
the year 3000. Works well with <code>1166</code>.
</description>
<launch-date/><p:launch-date/><island/></product>
<p:product id="1166">
<p:name>XSLT Perfect IDE</p:name>
<p:description>
<p:code>red</p:code>
<html:code>blue</html:code>
<html:div global="spam" class="eggs" xml:lang="en">
<ref xl:type="simple">A link</ref>
</html:div>
</p:description>
</p:product>
</products>
This output uncovers the same bug that I pointed out in minidom
last article. I explicitly asked for the global
attribute generated in task 4 to be in the XHTML namespace. Even
though I did not specify it as a QName, the processor should still
have used a prefix for the output because an attribute without a
prefix is in no namespace, regardless of the namespace of its
element. As I mentioned in the last article this is an obscure and
controversial corner of XML namespaces, so I'm not surprised the bug
appears to be widespread.
Wrap Up
Also in Python and XML | |
Should Python and XML Coexist? | |
I didn't cover PyXML because the most interesting libraries in it using namespaces are very similar to Python's SAX and minidom, which I did cover. PyXML also includes older versions of the XPath and XPattern libraries from 4Suite. The main idea behind 4Suite is to open up in-depth Python APIs to standard XML technologies, and this extends to all the relevant namespace facilities in various XML specifications. In the next article I shall continue this examination of namespace capabilities in Python tools.
Picking up on what my colleagues have been up to lately, I find Dave Kuhlman's update to generateDS, which includes the ability to interchange Python and XML literal text. For a more in-depth explanation see the announcement. I covered generateDS earlier in this article.
Manfred Stienstra wrote a couple of articles on the use of libxml2's Python bindings: "The Problem with the Libxml Python Bindings" and "More Problems with the Libxml Python Bindings".
Andrew Dalke has long been working on Martel, a tool for working the many flat file text-based file formats used in bioinformatics into XML. Recently his paper on the topic " Martel: Bioinformatics file parsing made easy" came to my attention again. Martel is a very clever idea and applicable beyond the world of bioinformatics. It can be used in general to treat "legacy" formats (including the likes of CSV and simple record-per-line files) as if they were already in XML. One warning is that the link to Martel in the paper is out of date. See the first sentence in this paragraph for the current link.