XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

More Gems From the Mines

More Gems From the Mines

November 12, 2003

In a recent article I started mining the riches of the XML-SIG mailing list, prospecting for some of its choicest bits of code. I found a couple of nice bits from 1998 and 1999. This time I cover 2000 and 2001, an exciting period where preparations for Python 2.0 meant that XML tools were finally gaining some long-desired capabilities in the core language. As in the last article, where necessary, I have updated code to use current APIs, style, and conventions in order to make it more immediately useful to readers. All code listings are tested using Python 2.2.2 and PyXML 0.8.3. See the last article for my note on the use of PEP 8: Style Guide for Python Code, which I use in updating listings.

Rich Salz's Simple DOM Node Builder

Yes, I have even more on generating XML output. This time the idiom is more DOM-like, building node objects rather than sending commands to a single output aggregator object, as with the output tools I have examined previously. Rich Salz posted a simplenode.py (June 2001), a very small DOM module ("femto-DOM") with an even simpler API than minidom, but geared mostly towards node-oriented generation of XML output. You use the PyXML canonicalization (c14n) module with it to serialize the resulting nodes to get actual XML output. The output is thus in XML canonical form, which is fine for most output needs, although it may look odd compared to most output from "pretty printers". Not having to take care of the actual output simplifies the module nicely, but it does mean that PyXML is required. There have been recent fixes and improvements to the c14n module since PyXML 0.8.3, so 0.8.4 and beyond should produce even better output once available. I touched refactored simplenode.py to use Unicode strings, for safer support of non-ASCII text. The result is in listing 1.

Listing 1: Rich Salz's Simple DOM node builder (save as simplenode.py)
from xml.dom import Node
from xml.ns import XMLNS

def _splitname(name):
    '''Split a name, returning prefix, localname, qualifiedname.
    '''
    i = name.find(u':')
    if i == -1: return u'', name, name
    return name[:i], name[i + 1:], name

def _initattrs(n, type):
    '''Set the initial node attributes.
    '''
    n.attributes = {}
    n.childNodes = []
    n.nodeType = type
    n.parentNode = None
    n.namespaceURI = u''

class SimpleContentNode:
    '''A CDATA, TEXT, or COMMENT node.
    '''
    def __init__(self, type, text):
	_initattrs(self, type)
	self.data = text

class SimplePINode:
    '''A PI node.
    '''
    def __init__(self, name, value):
	_initattrs(self, Node.PROCESSING_INSTRUCTION_NODE)
	self.name, self.value, self.nodeValue = name, value, value

class SimpleAttributeNode:
    '''An element attribute node.
    '''
    def __init__(self, name, value):
	_initattrs(self, Node.ATTRIBUTE_NODE)
	self.value, self.nodeValue = value, value
	self.prefix, self.localName, self.nodeName = _splitname(name)

class SimpleElementNode:
    '''An element.  Might have children, text, and attributes.
    '''
    def __init__(self, name, nsdict = None, newline = 1):
	if nsdict:
	    self.nsdict = nsdict.copy()
	else:
	    self.nsdict = {}
	_initattrs(self, Node.ELEMENT_NODE)
	self.prefix, self.localName, self.nodeName = _splitname(name)
	self.namespaceURI = self.nsdict.get(self.prefix, None)
	for k,v in self.nsdict.items():
	    self.addNSAttr(k, v)
	if newline: self.addText(u'\n')

    def nslookup(self, key):
	n = self
	while n:
	    if n.nsdict.has_key(key): return n.nsdict[key]
	    n = n.parentNode
	raise KeyError, u'namespace prefix %s not found' % key

    def addAttr(self, name, value):
	n = SimpleAttributeNode(name, value)
	n.parentNode = self
	if not n.prefix:
	    n.namespaceURI = None
	else:
	    n.namespaceURI = self.nslookup(n.prefix)
	self.attributes[name] = n
	return self

    def addNSAttr(self, prefix, value):
	if prefix:
	    n = SimpleAttributeNode(u'xmlns:' + prefix, value)
	else:
	    n = SimpleAttributeNode(u'xmlns', value)
        #XMLNS.BASE is the special namespace W3C
        #circularly uses for namespace declaration attributes themselves
	n.parentNode, n.namespaceURI = self, XMLNS.BASE
	self.attributes[u'xmlns:'] = n
	self.nsdict[prefix] = value
	return self

    def addDefaultNSAttr(self, value):
	return self.addNSAttr(u'', value)

    def addChild(self, n):
	n.parentNode = self
	if n.namespaceURI == None: n.namespaceURI = self.namespaceURI
	self.childNodes.append(n)
	return self

    def addText(self, text):
	n = SimpleContentNode(Node.TEXT_NODE, text)
	return self.addChild(n)

    def addCDATA(self, text):
	n = SimpleContentNode(Node.CDATA_SECTION_NODE, text)
	return self.addChild(n)

    def addComment(self, text):
	n = SimpleContentNode(Node.COMMENT_NODE, text)
	return self.addChild(n)

    def addPI(self, name, value):
	n = SimplePINode(name, value)
	return self.addChild(n)
	return self

    def addElement(self, name, nsdict = None, newline = 1):
	n = SimpleElementNode(name, nsdict, newline)
	self.addChild(n)
	return n

if __name__ == '__main__':
    e = SimpleElementNode('z', {'': 'uri:example.com', 'q':'q-namespace'})

    n = e.addElement('zchild')
    n.addNSAttr('d', 'd-uri')
    n.addAttr('foo', 'foo-value')
    n.addText('some text for d\n')
    n.addElement('d:k2').addText('innermost txt\n')

    e.addElement('q:e-child-2').addComment('This is a comment')
    e.addText('\n')
    e.addElement('ll:foo', { 'll': 'example.org'})
    e.addAttr('eattr', '''eat at joe's''')

    from xml.dom.ext import Canonicalize
    print Canonicalize(e, comments=1)  

As you can see, the node implementations, SimpleContentNode, SimpleAttributeNode, SimplePINode and SimpleElementNode are dead simple. The last is the only one with any methods besides the initializer, and these are pretty much all factory methods. The namespace handling is rather suspect in this module, although I was able to generate a simple document with a namespace. For parity, however, I put it to work against the same XSA output I have been testing other XML output tools (see the previous article, for example). No namespaces in this task. In the future I may revisit the output tools I have examined in this column to document their suitability for output using namespaces. Listing 2 is the code to generate the XSA file. Save listing 1 as simplenode.py before running it.

Listing 2: Code to generate XSA using simplenode.py
import sys
import codecs
from simplenode import SimpleElementNode

root = SimpleElementNode(u'xsa')
vendor = root.addElement(u'vendor')
vendor.addElement(u'name').addText(u'Centigrade systems')
vendor.addElement(u'email').addText(u'info@centigrade.bogus')

product = root.addElement(u'product').addAttr(u'id', u'100\u00B0')
product.addElement(u'name').addText(u'100\u00B0 Server')
product.addElement(u'version').addText(u'1.0')
product.addElement(u'last-release').addText(u'20030401')
product.addElement(u'changes')

#The wrapper automatically handles output encoding for the stream
wrapper = codecs.lookup('utf-8')[3]
from xml.dom.ext import Canonicalize
Canonicalize(root, output=wrapper(sys.stdout), comments=1)  

The c14n module claims to emit UTF-8, but it doesn't really seem to be smart about encodings because I got the dreaded UnicodeError: ASCII encoding error: ordinal not in range(128) when I tried to let the Canonicalizer take care of the output stream for me. I had to pass the old codecs output stream wrapper to get the Canonicalizer to output the degree symbol. The resulting output is


$ python listing2.py <xsa>
<vendor>
<name>
Centigrade systems</name><email>
info@centigrade.bogus</email></vendor><product id="100°">
<name>
100° Server</name><version>
1.0</version><last-release>
20030401</last-release><changes>
</changes></product></xsa>  

Other Bits

I didn't find much other code that was interesting enough and ready for use. There are some near gems in the archives that may also be of interest to some users, but perhaps not really worthy of the full treatment of update and detailed commentary in this column.

In "The Zen of DOM" (April 2000) Laurent Szyster attaches "myXML-0.2.8.tgz" (not to be confused with the current PHP project of the same name). It is basically a Python data binding: a tool that allows you to register special Python objects against XML element names and parse the XML files into data structures consisting of the special objects. The package is not as versatile as current available data binding packages (see recent articles in this column), but it does include some interesting ideas. Brief inspection of the code indicates that it would probably work with minor modifications on more recent pyexpat builds.

Benjamin Saller posted an API and module for simple access to data in XML files (June 2000)). It allows you to parse XML and creates specialized data structures, similar to those of ElementTree. You can use strings with a special syntax not unlike XPath. You can also specify data type conversions from XML character data. See the thread after the original module was posted for some problems, tweaks, and suggestions.

Tom Newman announced a text-based browser/editor for XML (July 2000), tan_xml_browser. The package requires PyNcurses and an old version of 4DOM. It would need some work to update the DOM bindings.

Clark Evans posted a "Namespace Stripper Filter" (March 2001), a SAX filter that strips elements and attributes in a given namespace (it only handles one namespace at a time). It is a good example of namespace filters, except that it accesses internal objects of the SAX API in a bid for performance. This is not an advisable practice, as Lars Marius Garshol warns in a follow-up.

Sjoerd Mullender posted what he claimed to be a validating XML parser in Python (April 2001). The script can emit an XML output form suitable for testing and clean comparisons, similar in intent to the W3C standard Canonical XML format used in PyXML's c14n module. I've always argued that it almost never a good idea for anyone to roll their own XML parser. This code does, however, have several useful passages, especially the various regular expressions defined toward the top.

It's worth nothing one gem that is not a bundle of Python code, at least not directly. Martin von Löwis was working on a unified API for XPath libraries in Python in 2001. He decided to define the interface using an ISO Interface Definition Language (IDL) module for XPath libraries. It defines all the axes, operators, and node types in programmatic detail. Martin used this IDL as the basis for his Python XPath implementation.

And while I'm on variant gems, it is well-known wisdom in Python that it is much more efficient to create big strings by using one-time list concatenation or cStringIO rather than simple addition of strings. String processing is, of course, very important in XML processing so Tom Passim, an XML developer, took the initiative to do some clear analysis on the matter, compiling "New data on speed of string appending" for Python 1.5.2. Conclusion: don't forget to use cStringIO, the winner of the shootout. And if you decide on the runner up, recent Python releases provide string methods so that you would instead write:


NO_SEPARATOR = ''
snippet_list = []
for snippet in strings_to_concat:
   snippet_list.append(snippet)
created_str = NO_SEPARATOR.join(snippet_list)  

I'll also point to a posting that I haven't been able to run or analyze successfully enough to determine whether it is a gem, but which does look promising. Clark Evans posted a "Simple wxPython XSLT Testing Tool using MSXML" which uses a GUI interface for running Microsoft's MSXML XSLT processor. He points to TransformStage, a function where you could substitute invocation of some other XSLT processor; it's the only place where the pythoncom module, imported at the top, is used, and this module is only available on Windows.

Wrapping Up

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

In the future I plan to return to this thread to look at XML-SIG postings in 2002 and 2003. There is more useful code available in the XML-SIG archives, and I will return to this topic in the future, presenting updates of other useful code from the archives. If you think may have missed any great postings since 1998, feel free to attach comments to this article or post them to the XML-SIG mailing list.

Meanwhile, it has been a slow month in new Python-XML development.

Pyana 0.8.1 was released. It has been updated for Xalan 1.6/Xerces 2.3. See Brian Quinlan's announcement.

Roman Kennke developed a module, domhelper.py, with functions to provide some common operations on DOM, including looking up namespace URIs and prefixes, non-recursively getting text or child elements of a given node. Be warned that this module does raise errors in strange situations, but you should be able to comment out the extraneous error checking easily enough. The module is actually part of the Python-SOAP project. See Kennke's announcement.