Making Old Things New Again

April 20, 2005

There have been recent releases of two of the Python-XML projects in which I'm involved; 4Suite and Amara XML Toolkit. One common theme in both releases was marked improvements to the XML document creation APIs. These improvements are significant enough to discuss and compare to the other systems for XML output I have presented in this column. The code uses 4Suite version 1.0b1 and Amara 1.0b2, running under Python 2.3.5. Installation is basically the same as in my earlier articles covering these packages: "Three More for XML Output" and "Introducing the Amara XML Toolkit".

4Suite's MarkupWriter

New in 4Suite 1.0b1 is the class Ft.Xml.MarkupWriter, which is specialized for creating XML documents from scratch. It offers at least one feature I haven't seen in any other output libraries. Listing 1 uses this class to generate a simple XML Software Autoupdate (XSA) file. XSA is the XML data format I've been using as a standard example article for XML output. It is a format for listing and describing software packages.

Listing 1: Using 4Suite MarkupWriter to generate XSA

from Ft.Xml import MarkupWriter

#Set the output doc type details (required by XSA)

SYSID = u"http://www.garshol.priv.no/download/xsa/xsa.dtd"

PUBID = u"-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML"

writer = MarkupWriter(indent=u"yes", doctypeSystem=SYSID,

                      doctypePublic=PUBID)

writer.startDocument()

writer.startElement(u'xsa')

writer.startElement(u'vendor')

#Element with simple text (#PCDATA) content

writer.simpleElement(u'name', content=u'Centigrade systems')

writer.simpleElement(u'email', content=u"info@centigrade.bogus")

writer.endElement(u'vendor')

#Element with an attribute

writer.startElement(u'product', attributes={u'id': u"100\u00B0"})

writer.simpleElement(u'name', content=u"100\u00B0 Server")

writer.simpleElement(u'version', content=u"1.0")

writer.simpleElement(u'last-release', content=u"20030401")

#Empty element

writer.simpleElement(u'changes')

writer.endElement(u'product')

writer.endElement(u'xsa')

writer.endDocument()

This illustrates the basics of the API, but there are some advanced features available. As an example, attributes can be added individually rather than in dictionary form. As such the line:


writer.startElement(u'product', attributes={u'id': u"100\u00B0"})

Could instead be written


writer.startElement(u'product')

writer.attribute(u'id', u"100\u00B0")

There is also similar flexibility when omitting text. The line:


writer.simpleElement(u'name', content=u'Centigrade systems')

Could instead be written


writer.startElement(u'product')

writer.text(u'Centigrade systems')

writer.endElement(u'product')

simpleElement is basically a shortcut for the startElement/endElement combination. More interestingly, MarkupWriter allows you to insert well-formed XML entities as complete chunks in the output. This is a very convenient way to omit boilerplate XML without breaking it down into all the separate element/attribute/content bits. As such the lines:


writer.simpleElement(u'name', content=u"100\u00B0 Server")

writer.simpleElement(u'version', content=u"1.0")

writer.simpleElement(u'last-release', content=u"20030401")

Could instead be written:


writer.xmlFragment("""    <name>100° Server</name>

    <version>1.0</version>

    <last-release>20030401</last-release>""")

Output Parameters

In Listing 1, you can see how parameters that control the output are passed into the MarkupWriter initializer, including document type info and whether to indent (pretty print). You can pass any of the usual controls for XSLT output into the initializer in this way. So for instance omitXmlDeclaration=u"yes" could be used to suppress output of the XML declaration. By default MarkupWriter sends its output to sys.stdout, but you can substitute any file-like object by passing in an initializer parameter. For example:


writer = MarkupWriter(output_file, indent=u"yes")

You can also set other parameters, based on those in the XSLT spec:

encoding - the character encoding to use (default UTF-8). The writer will automatically use character entities where necessary.
omitXmlDeclaration - "yes" to suppress output of the XML declaration. Default "no".
standalone - "yes" to set standalone in the XML declaration.
mediaType - sets the media type of the output. You'll probably never need this.
cdataSectionElements - a list of element names whose output will be wrapped in a CDATA section. This can provide for friendlier output in some cases.

The XSLT spec also defines a method parameter to choose between XML, HTML or plain text output rules, but for MarkupWriter at the moment you should stick to XML. The result of changing the method is undefined. We'll probably relax this restriction in later releases.

The output from Listing 1 is shown in Listing 2. I added new lines to the document type declaration for formatting reasons. One important thing to keep in mind is that MarkupWriter produces output incrementally. There is no waiting for the entire document to be built before sending output.

Listing 2: XSA output from Listing 1


<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE xsa PUBLIC

"-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML"

"http://www.garshol.priv.no/download/xsa/xsa.dtd">

<xsa>

  <vendor>

    <name>Centigrade systems</name>

    <email>info@centigrade.bogus</email>

  </vendor>

  <product id="100°">

    <name>100° Server</name>

    <version>1.0</version>

    <last-release>20030401</last-release>

    <changes/>

  </product>

</xsa>

Creating documents with Amara

Amara's Bindery component now also allows you to create XML documents from scratch. The interface is not quite as rich as MarkupWriter, but it has some similarities. Amara's API is probably more suitable if you're writing programs that have a variety of document reading and update tasks, besides just creating output. If you really just want to write XML as directly as possible, MarkupWriter is probably a better bet. Listing 3 uses Amara to generate the same XSA text as Listing 1.

Listing 3: Using Amara Bindery to generate XSA

from amara import binderytools

#Set the output doc type details (required by XSA)

SYSID = u"http://www.garshol.priv.no/download/xsa/xsa.dtd"

PUBID = u"-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML"

#Create a document object with basic details, to be updated with

#Additional elements/attributes/content in subsequent lines

doc = binderytools.create_document(u'xsa', sysid=SYSID, pubid=PUBID)

doc.xsa.xml_append(doc.xml_element(u'vendor'))

#Element with simple text (#PCDATA) content

doc.xsa.vendor.xml_append(

    doc.xml_element(u'name', content=u'Centigrade systems'))

doc.xsa.vendor.xml_append(

    doc.xml_element(u'email', content=u'info@centigrade.bogus'))

#Element with an attribute

doc.xsa.xml_append(

    doc.xml_element(u'product', attributes={u'id': u"100\u00B0"}))

doc.xsa.product.xml_append(

    doc.xml_element(u'name', content=u'100\u00B0 Server'))

doc.xsa.product.xml_append(

    doc.xml_element(u'version', content=u'1.0'))

doc.xsa.product.xml_append(

    doc.xml_element(u'last-release', content=u'20030401'))

doc.xsa.product.xml_append(doc.xml_element(u'changes'))

print doc.xml(indent=u"yes")  #Print it

The basic differences in style have to do with what's happening under the covers. Whereas MarkupWriter is just sending instructions to generate a stream of characters, in the Amara example I am building the document as objects in memory. This has several implications. For one thing, MarkupWriter will tend to be faster and much less memory-hungry, while Amara will provide more flexibility for building the document (you can think of it as a sort of cursor with random access to the entire document being built at any time). In the Amara example, you also wait until the entire document is built before triggering the actual output (in this case using print). The xml method used to generate XML output accepts the same XSLT-based output control parameters as described above for MarkupWriter.

Conclusion

We really turned our jets on this week in the community. I myself kept very busy with releases of two of the main Python packages I maintain or help to maintain for XML processing.

First of all comes 4Suite 1.0b1, the next big step toward the long overdue 1.0 release. Performance is the main theme of this release, and there have been gains in all the core libraries. The biggest gains have come where they count the most: in the Domlette library that forms the basis of much of 4Suite. The Python implementation of Domlette has been removed, and 4Suite is no longer dependent on pyexpat, which helps with several platform-specific issues. The MarkupWriter class that I introduced in this article is also new in this release. There are now BerkeleyDB and MySQL drivers for RDF and the repository. As usual there are many bug fixes and minor enhancements. See the announcement.

I also announced Amara XML Utilities 1.0b2. The main changes are improvements to the mutation API. The user can now perform common replacement and deletion actions using familiar Python idioms. See the announcement.

David Mertz announced Gnosis Utils 1.2.0, which seems to be the culmination of a good deal of work on the package I've featured a couple of times already in this column. The announcement says, "This release of the Gnosis Utilities contains several new modules, as well as fixes, enhancements, and speedups in existing subpackages." One of the new modules is XML-related: gnosis.xml.xmlmap, "Unicode->XML legality testing & Unicode helper functions." There are also many updates to gnosis.xml.objectify, and the fixes I suggested in "Full XML Indexes with Gnosis " have been incorporated in gnosis.xml.indexer. See the announcement.

Martijn Faassen announced lxml 0.5.1, "a Pythonic binding for the libxml2 and libxslt libraries." It largely follows the ElementTree API, extending this to "expose libxml2 and libxslt specific functionality, such as XPath, Relax NG, XSLT, and c14n". See the announcement.

Ivan Voras wrote a simple "XML parser" of his own, xmldict. I thought very hard about even mentioning this code, because it's a pretty dangerous idea. Ivan says "It's really quick and dirty. It doesn't even use standard parsers such as dom or sax, but improvises its own. It's also very likely to fail in mysterious ways when it encounters invalid XML, but for quick and dirty jobs, it's very nice and easy. See the bottom of the file for some quick examples." It's OK to use such a package as long as you are crystal clear to yourself and others that you are not using anything like an XML parser. Think of it like this: you might use grep to search a directory full of XML files, but you'd never fool yourself into saying you were doing any XML parsing.

I almost missed an interesting blog entry by Danny Ayers:

"I just got around to reading Uche Ogbuji’s interesting Wrestling HTML article at xml.com on parsing HTML with Python. Uche mentions HTML to XML, something the rough SAX2-styled HTML soup/ill-formed XML parser code I did a while ago can do pretty easily (not well, but easily). So I’ve added a quick handler to demo that (there’s already an RSS/Atom feed reading demo in there) and here it is: psoup_2004-09-13.zip. "

"It’s not been tidied up like JSoup (Java version, still hacky!), but it still might make a useful starting point for someone. "

And speaking of blogs, I have established a blog of my own, Copia, (with my brother and fellow XML.com author Chimezie Ogbuji). Starting this week, I'll be posting brief links to selected Python-XML software announcements there. Of course I'll continue to provide the monthly summary in this space.