Should Python and XML Coexist?

August 24, 2005

Recently there have been some discussions in the Python community about whether and where XML is useful. As I've mentioned before, the Python community tends to be rather hostile to XML. The recent round of discussions has mixed some of that raw scorn with a bit of nuance, and it seems a good time to examine some of the considerations that shape the intersection between these two popular and powerful technologies.

Avoiding XML Sit Ups

Just to pick a good place to start the discussion, let's look at the particular XML usage scenario that has sparked recent discussions. Phillip J. Eby is one of the core developers at the Open Source Applications Foundation, where the primary project is Chandler, an enterprise grade groupware application written in Python. The project includes a component architecture called Parcels, which were originally expressed in XML. Recently, the decision was made to move from XML to Python code itself for expressing parcels. Eby discussed this decision in a weblog post with the provocative title Chandler Begins Recovery from XML. The contents are somewhat less inflammatory.

Some of you may be thinking back to my Python Is Not Java rant, in which I said that using XML for core application functionality like this was, well, unwise. :) At the PyCon Chandler sprint, it was discovered that the Chandler's homegrown XML schema definition language was a terrible hardship on developers, and so I proposed to replace it with a descriptor-based Python API. That migration was completed recently. With that done, only initialization of data items (such as Chandler's UI components) was done using XML. So, a few weeks ago, I implemented an experimental API for initializing data items, which quickly became quite popular, with some even pointing out the advantages of being able to factor out repetition.

More about that "Python Is Not Java" rant in a bit, but first more from the more recent article:

For a while, there was also a proposal to create a new XML format just for UI definition. But my counterproposal for using a simple template class and a classmethod instead was met with great rejoicing.

Many people misunderstood and/or misrepresented my previous position on XML; the case of Chandler should help to clarify it. Chandler still uses XML for WebDAV, for .xrc files, for sharing, and numerous other use cases where it makes at least some sense to do so. The parcel.xml format, however, was pure excise: a verbose additional language to do things that are more cleanly (and efficiently) done in Python code. It was developed to serve a vision of Chandler as a "data-driven" system, and it was supposed to ultimately support things like GUI editors.

Of course, the real sin here was not so much XML per se, as overengineering in advance of requirements. If you're not developing the feature now, it's best not to make a bunch of other design decisions based on what you think the feature will need. A little thing like choosing to put data in XML form can result in a wide variety of additional costs...

Java (and C++, etc.) certainly do have a lot to do with this matter. Such languages might do the job for general applications programming, but the basic restrictions of most static languages make them a poor choice for little languages within host applications. If you need a configuration or script file of some sort for an application written in Java, you probably would not want to use Java for the language for that file. There are all sorts of little languages that are used for such cases, but XML has been dominating the scene lately. It has the advantage of being relatively straightforward to process, internationalized, portable, flexible, and extensible. Certainly Java developers these days see XML everywhere, from Apache Ant to J2EE server configuration.

When migrating from languages such as Java to dynamic languages such as Python it's easy to forget to reassess the value of XML as a little language. There is much less need for separate little languages in the case of a dynamic language, though. If you define configuration and scripting for an application in a dynamic host language, and you use the same language to express the script, the instructions in the script files can be directly executed in the context of the host process, which provides a tremendous amount of flexibility. (It might also open up security issues if you allow script files from external sources, but I'll skirt that issue in this discussion.) This does not mean that there is never a need for a separate little language. After all, there isn't a loud cry for a Python regular expression syntax, but separate little languages are not often needed in Python programs. Using Python itself for scripting gives you the full power of Python, and the script author is not restricted to simple key-value style parameters.

If you pay close attention, Phillip's complaints are all about using little languages in general within Python programs and really have nothing special to say about XML. The point is that if you don't need to invent a new syntax that the Python developer needs to learn, you shouldn't do so, because doing so is erecting an unnecessary hurdle. This should be simple common sense rather than an exhibit in the case against XML.

This basic reasoning applies to most dynamic languages, not just Python, and in many other dynamic languages XML is the easiest target among possible little language formats, because of its popularity. The cutest turn of phrase in this campaign comes from Ruby, where the popular Ruby on Rails framework uses the following blurb:

Rails is a full-stack, open source web framework in Ruby for writing real-world applications with joy and less code than most frameworks spend doing XML sit-ups.

The emphasis is mine. I guess this phrase is intended to speak to J2EE developers, who are used to working through layers upon layers of XML in order to set up configuration. I think you already tend to find much less of that in dynamic language projects, and in fact I'd suggest that you are just as likely to find ".ini"-style file formats as XML in dynamic language projects. Of course, ".ini" can be a poor choice for the same reasons as XML.

There Is Such a Thing as Overdoing It

It's one thing to say that XML is often not the best choice for configuration and scripting in Python applications, but one has to be careful not to overstate this fact. Phillip comes close to doing so in his post Python Is Not Java.

This is a different situation than in Java, because compared to Java code, XML is agile and flexible. Compared to Python code, XML is a boat anchor, a ball and chain. In Python, XML is something you use for interoperability, not your core functionality, because you simply don't need it for that. In Java, XML can be your savior because it lets you implement domain-specific languages and increase the flexibility of your application "without coding." In Java, avoiding coding is an advantage because coding means recompiling. But in Python, more often than not, code is easier to write than XML. And Python can process code much, much faster than your code can process XML. (Not only that, but you have to write the XML processing code, whereas Python itself is already written for you.)

He goes on at length, but it's really just different ways of restating this core paragraph. The second sentence is where overstatement comes in: if XML is used where it should best be used, and ditto for Python code, it shouldn't even make sense to try to compare the two in the same terms. Unfortunately, one sees a lot of misguided overgeneralization starting with valid complaints about where XML is not suitable. The rest of the quoted paragraph is more careful, and it's clear Phillip is not claiming, for example, that people use Python code rather than HTML (or XHTML) to express Web pages. XML is no more a good code format than Python is a good document format.

XML is the result of the meeting of two very distinct worlds: the database/data structure worlds and the document management world. As a result, XML is reasonably suitable for expressing data structures, and reasonably so for documents as well. I personally argue that XML is much more suited for documents than for data structures, but this is a long-standing debate in the XML community. I do, however, observe that dissatisfaction with XML seems to emerge much more loudly when XML is used to express data structures. The biggest complaints of the document crowd with XML, in my estimation, are its lack of minimization tricks (as in SGML) and its lack of support for overlapping markup. On the other hand, programmer types are much more prone to call in the entire reason for XML's being, sometimes to the point of overreacting.

I personally consider this to be evidence that the trend toward injecting more and more of the character of programming languages and databases into XML is deeply misguided. W3C XML Schema and XQuery do even more to blur the line between applications and semistructured data. Developers in languages such as Java see this as a good thing because they already rely so heavily on XML that ever-closer union seems natural. Unfortunately, the message is not always clear that users of dynamic languages should consider less complex and rigid alternatives such as RELAX NG and XPath. I have long said that I would rather use Python and XPath to access XML documents and even XML data stores than XQuery, but being familiar with Java/XML APIs, I can understand why XQuery would be attractive in that case.

In an interesting twist on this whole matter, even in languages such as Java, there is some backlash emerging against overuse of XML. Some developers rue the need for complex XML in scripting scenarios where it might have been better to use a language such as Jython, which is already tightly integrated into the host language, and is far better suited to writing code than XML.

Conclusion

There is plenty of room for discussion about where XML can be useful to Python programmers, and where it can be a hindrance. There is also plenty of room to discuss which XML-related technologies are well suited to use with Python, and which might be best avoided. I'll cover such matters in coming articles. Meanwhile, it's great to see that the Python community has been doing a lot more than just complaining about XML.

Starting close to home, I pushed Amara XML Toolkit to version 1.0. I've covered Amara here in the past (Introducing the Amara XML Toolkit and Making Old Things New Again). Amara's centerpiece is Bindery, a very Pythonic XML API. The biggest change is a package option that incorporates the prerequisites (from 4Suite), in order to remove one installation step. You no longer need anything except for Python to install Amara from one package in one step. See the announcement.

Walter Dörwald announced XIST 2.11, which I covered in the recent article Writing and Reading XML with XIST . It's a very capable, open source package for XML and HTML processing and generation. The biggest change is script, xml2xsc.py, which parses a sample XML instance and generates sub classes for XIST. See the announcement for the full catalog of changes, including many more fixes and a few minor API updates.

Christof Hoeke just keeps on pushing out packages. This time it's cssutils 0.8a3, "a Python package to parse and build CSS Cascading Style Sheets." cssutils implements portions of DOM Level 2 Stylesheets CSS interfaces. See the announcement.

Alexander Schremmer announced MoinMoin 1.3.5. MoinMoin is a widely used Wiki server written in Python. It offers several XML features, including Docbook content and XSLT rendering, which saw some work in this release as the 4Suite compatibility was improved. Many more details in the announcement.

Also in Python and XML

Processing Atom 1.0

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Making Old Things New Again

John Holland announced pyx12 1.2.0. "Pyx12 is a HIPAA X12 document validator and converter. It parses an ANSI X12N data file and validates it against the Implementation Guidelines for a HIPAA transaction. By default, it creates a 997 response. It can create an html representation of the X12 document or can translate to an XML representation of the data file." This release focuses on XML-output format adjustments and additions, with some bugs and performance tweaks. See the announcement.

J. David Ibáñez released itools 0.10.0, a collection of utilities. It includes some XML-related modules including: itools.xml (a parser with some similarities to pulldom), itools.schemas, itools.rss (RSS 2.0), itools.xliff (XLIFF--XML Localization Interchange File Format), itools.xhtml, itools.tmx (TMX--Translation Memory eXchange). It also includes Simple Template Language (STL), a language for embedding template-processing instructions in XHTML. See the announcement.

Julien Oster announced xmlrpcserver 0.99.1. "xmlrpcserver is a simple to use but fairly complete XML-RPC server module for Python, implemented on top of the standard module xmlrpclib. This module may, for example, be used in CGIs, inside application servers or within an application, or even standalone as an HTTP server waiting for XML-RPC requests." See the announcement, which includes a complete code example.

Following up on my article Wrestling HTML I wrote a couple of articles detailing further experiences turning an HTML mess into clean XHTML. In Use Amara to Parse/Process (Almost) any HTML I showed how to use the HTML tidy command line to feed HTML to Amara. In Beyond HTML Tidy I give a workout to John Cowan's TagSoup command line tool as well as BeautifulSoup.

Leslie Michael Orchard announced a module xslfilter.py for WSGI--Python Web Server Gateway Interface. It uses lxml to optionally run XSLT transforms against XML produced by server code, and send the result to the client.

Radovan Garabík's announcement of unicode 0.4.7 is a nice follow-up to my last article, which included discussion of the Python Unicode database module. Radovan's tool is a "simple python command line utility that displays properties for a given unicode character, or searches unicode database for a given name," building on that database.