XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XML Namespaces Support in Python Tools, Part Two

XML Namespaces Support in Python Tools, Part Two

May 13, 2004

In the last article I discussed namespace handling in Python 2.3's SAX and minidom libraries. As I pointed out there are a lot of pitfalls and oddities involved with processing namespaces, and I will continue to give the same treatment to the namespace support in third party Python libraries. In this article I shall focus on the various libraries packaged in 4Suite. If you need background on 4Suite, see my earlier article "A Tour of 4Suite ". I did briefly cover how to express namespaces for use in 4XPath in that article, but in this one I will explore different angles on the topic.

The Namespace Torture Sample, Revisited

Listing 1 is the same sample document I used in the last article. If you haven't read that article I recommend you at least review it for discussion of the aspects of namespaces I exercise in this rather contrived example.

Listing 1: Sample document that uses many XML namespace features and oddities
<products>
  <product id="1144"
    xmlns="http://example.com/product-info"
    xmlns:html="http://www.w3.org/1999/xhtml"
	>
    <name xml:lang="en">Python Perfect IDE</name>
    <description>
      Uses mind-reading technology to anticipate and accommodate
      all user needs in Python development.  Implements all
      <html:code>from __future__ import</html:code> features though
      the year 3000.  Works well with <code>1166</code>.
    </description>
  </product>
  <p:product id="1166" xmlns:p="http://example.com/product-info">
    <p:name>XSLT Perfect IDE</p:name>
    <p:description
      xmlns:html="http://www.w3.org/1999/xhtml"
      xmlns:xl="http://www.w3.org/1999/xlink"
	  >
      <p:code>red</p:code>
      <html:code>blue</html:code>
      <html:div>
        <ref xl:type="simple" xl:href="index.xml">A link</ref>
      </html:div>
    </p:description>
  </p:product>
</products>

4Suite's XPath and Namespaces (Reading)

4Suite implements the natural namespace support in specifications such as XPath and XUpdate, which can be used respectively to exercise the namespace reading and mutation tasks I set up in the last article. Listing 2 uses XPath to display the local name, namespace and prefix of each element and attribute in a document.

Listing 2: 4Suite/XPath code to display namespace information for elements and attributes
import sys
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml.Xslt import PatternList
from Ft.Xml.Domlette import NonvalidatingReader

#Compile needed XPath expressions
NS_NODES_EXPR = Compile('//*|//@*')
NSURI_EXPR = Compile('namespace-uri()')
LNAME_EXPR = Compile('local-name()')
PREFIX_EXPR = Compile('substring-before(name(), ":")')

#XPattern is syntactically a subset of XPath
IS_ATTR_PAT = '@*'

#Second parameter is a dictionary of prefix to namespace mappings
plist = PatternList([IS_ATTR_PAT], {})

#Read in the file
doc = NonvalidatingReader.parseUri(sys.argv[1])

#Set up the XPath context with the docment read in
context = Context(doc)

#Extract all the element and attribute nodes in the doc
nodes = NS_NODES_EXPR.evaluate(context)

for node in nodes:
    context = Context(node)
    #Use XPattern to determine the current node type
    if plist.lookup(node):
        node_type_str = 'attribute'
    else:
        node_type_str = 'element'
    #Output the namespace details fo rthe current node
    nsuri = NSURI_EXPR.evaluate(context)
    print node_type_str, ' namespace:', repr(nsuri)
    lname = LNAME_EXPR.evaluate(context)
    print node_type_str, ' local name:', repr(lname)
    prefix = PREFIX_EXPR.evaluate(context)
    print 'Prefix used for', node_type_str, repr(prefix)

This code is also a bit contrived in order to illustrate how to perform all the subtasks using XPath and XPattern, along the lines of the usual division of labor where the former is used for gathering nodes and processing the basic data model and the latter is used for checking to see whether nodes conform to certain rules. Using an XPath expression I gather up all elements and attributes, and they are naturally returned to Python in document order. I then iterate over the nodes checking each against an XPattern to determine whether it is an attribute. XPath provides functions to get the namespace and local name for a given node, but not one for extracting the prefix. This is easily done, though, by using the substring-before function and the syntactic limitations on colons in QNames.

The output from this code run against our sample document is as follows:


$ python listing2.py products.xml
element  namespace: u''
element  local name: u'products'
Prefix used for element u''
element  namespace: u'http://example.com/product-info'
element  local name: u'product'
Prefix used for element u''
attribute  namespace: u''
attribute  local name: u'id'
Prefix used for attribute u''
element  namespace: u'http://example.com/product-info'
element  local name: u'name'
Prefix used for element u''
attribute  namespace: u'http://www.w3.org/XML/1998/namespace'
attribute  local name: u'lang'
Prefix used for attribute u'xml'
element  namespace: u'http://example.com/product-info'
element  local name: u'description'
Prefix used for element u''
element  namespace: u'http://www.w3.org/1999/xhtml'
element  local name: u'code'
Prefix used for element u'html'
element  namespace: u'http://example.com/product-info'
element  local name: u'code'
Prefix used for element u''
element  namespace: u'http://example.com/product-info'
element  local name: u'product'
Prefix used for element u'p'
attribute  namespace: u''
attribute  local name: u'id'
Prefix used for attribute u''
element  namespace: u'http://example.com/product-info'
element  local name: u'name'
Prefix used for element u'p'
element  namespace: u'http://example.com/product-info'
element  local name: u'description'
Prefix used for element u'p'
element  namespace: u'http://example.com/product-info'
element  local name: u'code'
Prefix used for element u'p'
element  namespace: u'http://www.w3.org/1999/xhtml'
element  local name: u'code'
Prefix used for element u'html'
element  namespace: u'http://www.w3.org/1999/xhtml'
element  local name: u'div'
Prefix used for element u'html'
element  namespace: u''
element  local name: u'ref'
Prefix used for element u''
attribute  namespace: u'http://www.w3.org/1999/xlink'
attribute  local name: u'type'
Prefix used for attribute u'xl'
attribute  namespace: u'http://www.w3.org/1999/xlink'
attribute  local name: u'href'
Prefix used for attribute u'xl'

The output is all as expected except that you'll notice that null namespaces and prefixes are represented using u'' rather than the Python convention None. This is natural enough given that XPath is not a Python specification, and it is usually not problematic because you almost always know when an XPath could return a u'' that would need to be fixed up to None for further processing in Python.

4Suite's XUpdate and Namespaces (Mutation)

XUpdate is a community specification for using an XML vocabulary to express modifications to XML documents. It is supported by many XML processing tools, especially in the open source category; and 4Suite provides an XUpdate library as well as a command line tool which applies XUpdate and can, for example, be used as a patching utility for XML. In order to show how to use XUpdate to make namespace-aware modifications, I shall perform the following tasks, which are the same as in the last article:

  1. Add a new element in the products namespace, but using no prefix.
  2. Add a new element with a prefix and in the products namespace.
  3. Add a new element that is not in any namespace.
  4. Add a new global attribute in the XHTML namespace.
  5. Add a new global attribute in the special XML namespace.
  6. Add a new attribute in no namespace.
  7. Remove only the code element in the XHTML namespace
  8. Remove a global attribute
  9. Remove an attribute that is not in any namespace

I don't demonstrate modification in place because this can always be done equivalently with an addition and then a removal. Listing 3 shows how these tasks can be performed in XUpdate.

Listing 3: XUpdate script to make namespace-aware additions and removals of elements and attributes
<xup:modifications version="1.0"
  xmlns:xup="http://www.xmldb.org/xupdate"
  xmlns:p="http://example.com/product-info"
  xmlns:html="http://www.w3.org/1999/xhtml"
  xmlns:xl="http://www.w3.org/1999/xlink"
>

  <!-- Task 1 -->
  <xup:append select="/products/p:product[1]">
    <xup:element
      name="launch-date"
      namespace="http://example.com/product-info"/>
  </xup:append>

  <!-- Task 2 -->
  <xup:append select="/products/p:product[1]">
    <xup:element
      name="p:launch-date"
      namespace="http://example.com/product-info"/>
  </xup:append>

  <!-- Can also be accomplished using literal result elements:
  <xup:append select="/products/p:product[1]">
    <p:launch-date/>
  </xup:append>
  -->

  <!-- Task 3 -->
  <xup:append select="/products/p:product[1]">
    <xup:element name="island"/>
  </xup:append>

  <!-- Can also be accomplished using literal result elements:
  <xup:append select="/products/p:product[1]">
    <island/>
  </xup:append>
  -->

  <!-- Task 4 -->
  <xup:append select="/products/p:product/p:description/html:div">
    <xup:attribute name="global"
      namespace="http://www.w3.org/1999/xhtml">spam</xup:attribute>
  </xup:append>

  <!-- Task 5 -->
  <xup:append select="/products/p:product/p:description/html:div">
    <xup:attribute name="xml:lang">en</xup:attribute>
  </xup:append>

  <!-- Task 6 -->
  <xup:append select="/products/p:product/p:description/html:div">
    <xup:attribute name="class">eggs</xup:attribute>
  </xup:append>

  <!-- Task 7 -->
  <xup:remove select="//html:code"/>

  <!-- Task 8 -->
  <xup:remove select="/products/p:product/p:description/html:div/ref/@xl:href"/>

  <!-- Task 9 -->
  <xup:remove select="/products/p:product[1]/@id"/>

</xup:modifications>

If you're familiar with XSLT, then you'll see the resemblance of XUpdate at first glance. The envelope element for modifications expressed in XUpdate is xup:modifications, similar to xsl:transform or xsl:stylesheet. The namespace declarations on this element assign prefixes for use in the XUpdate script and have no connection to the prefixes used in the document being modified (the source document), even though they happen to be the same. If you want to access elements in a namespace declared as the default in the source document, then just as in XSLT you must declare and use a prefix for the namespace in the XUpdate script.

Each modification request is expressed as an XUpdate instruction. This example demonstrates xup:append and xup:remove. There are other instructions providing types of modification such as xup:insert-before xup:update and there are also control constructs such as xup:if, which is similar to xsl:if. Instructions usually have a select attribute containing an XPath expression that specifies the node to be used as a reference for modification. In the case of xup:append, select specifies a node after which some new XML will be appended. In the case of xup:remove, select identifies nodes to be removed. When an instruction needs to specify a chunk of XML to be used in the modification it is expressed as the content of the instructions in a similar fashion to XSLT templates. In the case of xup:append this template expresses the chunk of XML to be inserted into the document. In order to generate elements and attributes XUpdate provides output instructions such as xup:element and xup:attribute, which are very similar to their XSLT equivalents. In another idea borrowed from XSLT, XUpdate allows you to create element by placing literal result elements in the templates. If you'd like to get a closer look at XUpdate, the best way is by browsing the very clear examples in the XUpdate Use Cases compiled by Kimbro Staken. See listing 4 for Python code that can be used to apply an XUpdate script. It's a simplified version of the code for the 4xupdate command line.

Listing 4: Python code for executing an XUpdate script against a source document and printing the result
import sys
from Ft.Xml import XUpdate
from Ft.Xml import Domlette, InputSource
from Ft.Lib import Uri

#Set up reader objects for parsing the XML files
reader = Domlette.NonvalidatingReader
xureader = XUpdate.Reader()

#Parse the source file
source_uri = Uri.OsPathToUri(sys.argv[1], attemptAbsolute=1)
source = reader.parseUri(source_uri)

#Parse the XUpdate file
xupdate_uri = Uri.OsPathToUri(sys.argv[2], attemptAbsolute=1)
isrc = InputSource.DefaultFactory.fromUri(xupdate_uri)
xupdate = xureader.fromSrc(isrc)

#Set up the XUpdate processor and run against the source file
#The Domlette for the source is modified in place
processor = XUpdate.Processor()
processor.execute(source, xupdate)

#Print the updated DOM node to standard output
Domlette.Print(source)

Notice the use of Uri.OsPathToUri to convert file system paths to proper URIs for use in 4Suite. I strongly recommend this convention as one way to help minimize confusion between file specifications and URIs -- the basis of many frequently asked questions. The XUpdate.Processor class defines the environment for running XUpdate commands and execute() is the method that actually kicks off the processing. It operates on a Domlette instance, modifying it in place (so be careful when using using XUpdate in this way). I print the updated document object to standard output using Domlette.Print.

This XUpdate worked fine with the latest CVS version of 4Suite, but the attribute additions did not work with that last packaged release, 1.0a3. It turns out that Mike Brown restored the ability to append attributes just last month. If you need this capability you'll need to use the CVS version until the next packaged release. The following snippet illustrates how to run the test script, and the output result.


$ python listing4.py products.xml listing3.xup
<?xml version="1.0" encoding="UTF-8"?>
<products xmlns:p="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink" > <product xmlns="http://example.com/product-info"> <name xml:lang="en">Python Perfect IDE</name> <description> Uses mind-reading technology to anticipate and accommodate all user needs in Python development. Implements all features though the year 3000. Works well with <code>1166</code>. </description> <launch-date/><p:launch-date/><island/></product> <p:product id="1166"> <p:name>XSLT Perfect IDE</p:name> <p:description> <p:code>red</p:code> <html:code>blue</html:code> <html:div global="spam" class="eggs" xml:lang="en"> <ref xl:type="simple">A link</ref> </html:div> </p:description> </p:product> </products>

This output uncovers the same bug that I pointed out in minidom last article. I explicitly asked for the global attribute generated in task 4 to be in the XHTML namespace. Even though I did not specify it as a QName, the processor should still have used a prefix for the output because an attribute without a prefix is in no namespace, regardless of the namespace of its element. As I mentioned in the last article this is an obscure and controversial corner of XML namespaces, so I'm not surprised the bug appears to be widespread.

Wrap Up

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

I didn't cover PyXML because the most interesting libraries in it using namespaces are very similar to Python's SAX and minidom, which I did cover. PyXML also includes older versions of the XPath and XPattern libraries from 4Suite. The main idea behind 4Suite is to open up in-depth Python APIs to standard XML technologies, and this extends to all the relevant namespace facilities in various XML specifications. In the next article I shall continue this examination of namespace capabilities in Python tools.

Picking up on what my colleagues have been up to lately, I find Dave Kuhlman's update to generateDS, which includes the ability to interchange Python and XML literal text. For a more in-depth explanation see the announcement. I covered generateDS earlier in this article.

Manfred Stienstra wrote a couple of articles on the use of libxml2's Python bindings: "The Problem with the Libxml Python Bindings" and "More Problems with the Libxml Python Bindings".

Andrew Dalke has long been working on Martel, a tool for working the many flat file text-based file formats used in bioinformatics into XML. Recently his paper on the topic " Martel: Bioinformatics file parsing made easy" came to my attention again. Martel is a very clever idea and applicable beyond the world of bioinformatics. It can be used in general to treat "legacy" formats (including the likes of CSV and simple record-per-line files) as if they were already in XML. One warning is that the link to Martel in the paper is out of date. See the first sentence in this paragraph for the current link.