An XML Hero Reconsiders?

March 19, 2003

Most, if not all of the permanent topics of conversation on XML-DEV revolve around two camps of people: one which thinks aspect N of XML is a wart, the other which thinks N is an elegance. These threads never end because, in part, there is no final or absolute context within which XML is meant to be used. Whether you think of N as a wart or an elegance is context dependent and interest relative. It depends almost entirely on who you are and what you want and need XML to do. In other words, all opinions about XML are equal. Except that that's not really true. All opinions about XML are equal, except some are more equal than others. Among the more equal opinions are ones held by the people who drafted the XML specification.

Among that select group of people, as far as XML-DEV is concerned, Tim Bray stands out, if for no other reason than he has consistently contributed to the conversational life of the community. So when Bray, over the course of a few weeks, leads an effort to relocate XML-DEV and publishes a widely-read essay in which he seems to question XML itself, it's time to take a closer look.

Is XML Too Hard?

In a recent weblog entry, one which has been picked up by Slashdot, Bray asks whether XML has become too hard for programmers. Faced with writing code "to process arbitrary incoming XML", Bray confesses that the experience was "irritating, time-consuming, and error-prone" -- quite an admission from someone as instrumental in the creation of XML as Bray. The point here -- before someone accuses me of hero worship -- isn't that Tim Bray is always right. He isn't. The point is that when Tim Bray starts talking about XML's problems, it makes sense for the XML development community to pay some attention.

So what's Bray's beef? It isn't, he says, that XML parsers are so hard to write. If they were, Bray says, there wouldn't be so many of them (which doesn't strictly follow, logically speaking, but it may be a useful heuristic). The problem is that XML parsers are so hard to use. (I wrote about a similar issue -- the difficulty of SAX or DOM for some kinds of programmer -- in one of my first XML-Deviant columns, "DOM and SAX Are Dead, Long Live DOM and SAX".)

Specifically, Bray offers the standard lament: DOM processing is inefficient, SAX processing is awkward.

If I use any of the perl+XML machinery, it wants me either to let it read the whole thing and build a structure in memory, or go to a callback interface.

Since we're typically reading very large datasets, and typically looking at the vast majority of it, preloading it into a data structure would be impractical not to say stupid. Thus we'd be forced to use parser callbacks of one kind or another, which is sufficiently non-idiomatic and awkward that I'd rather just live in regexp-land.

What? Bray "parses" XML with regular expressions? Apparently so, at least for some data munging tasks at Antartica, his data-mapping company. He's even written some code to make regexing XML more reliable:

Now here's the dirty secret; most of it is machine-generated XML, and in most cases, I use the perl regexp engine to read and process it. I've even gone to the length of writing a prefilter to glue together tags that got split across multiple lines, just so I could do the regexp trick.

Bray's preferred way of parsing XML for some kinds of project, like his weblog software, would be to have a kind of regular expression syntax which "abstracts away all the XML syntax weirdness, ignoring line-breaks, attribute orders, choice of quotemarkers and so on". While his example is Perl, it's not Perl-specific. "I want to have my idiomatic regexp cake," Bray says, "and eat my well-formed XML goodness too".

XML-DEV reacted to Bray's essay in a relatively muted way, perhaps because it was busy discussing his proposal to relocate XML-DEV? Simon St. Laurent suggested that Bray's lament reflects the different expectations and assumptions which follow from the markup and programming worlds. Dare Obasanjo took Bray to task, suggesting that what Bray is asking for is already available in Java and C# (and, as Daniel Veillard pointed out, in libxml2 and so in any language, like Python, which has libxml2 bindings; likewise, Sean McGrath reminded XML-DEV that this processing style is available in his Pyxie Python toolkit), namely, a "pull-based XML parser". Obasanjo went even further:

Tim's post indicates that he is quite disconnected from the world of modern XML programming practices, I especially like his "The notion that there is an 'XML data model' is silly and unsupported by real-world evidence" quote. I'm interested in what criteria he used to determine that the thousands of users of XPath, XSLT and the DOM don't count as "real-world evidence".

Micah Dubinko takes a different approach, noting that part of the difficulty of parsing some kinds of XML is handling ID values. One of the changes that might be made to XML to make it less painful to process is to fix the ID issue. As Dubinko puts it,

Anyone who's worked much with XML knows that IDs are painful, since they require DTD or schema processing. A recurring proposal has been circulating for a self-describing xml:id attribute that confers ID-ness without need of DTD or schema. With that in place, even XML delivered inside the DTD-free zone of a SOAP envelope could be handled with code not significantly more complex than Adam's example, and without the dependencies and hassle of a schema language.

Two very substantial posts, by Robin Berjon and Barrie Slaymaker, responded specifically to Perl-specific part of Bray's comments. Berjon makes the obvious point:

A couple years after you think you and your friends have convinced all the Perl community to not use regex based parsers on XML you get Tim Bray to hit you on the back of the head with one. I guess it's fair though, because he makes sense.

Which is especially telling (and rather funny) if you remember classics which asked whether "wily Perl hackers" could handle XML. Berjon's primary suggestion is to commend Barrie Slaymaker's XML::Filter::Dispatcher module:

It is basically a collection of utilities wrapped inside rules that match an XML stream using an XPath-like language (which is probably preferable to raw regexen, and has in fact similar functionality). They will do things such as maintain state for you or assist you doing so, in a way that is much simpler than SAX. You need to learn a new little language, but if you know XPath it's a five minute job. I don't think the docs do it full justice.

Also in XML-Deviant

The More Things Change

This sounds an awful lot like the kind of thing Bray is asking for.

If you're a Perl-XML hacker, you could do far worse than paying very close attention to Barrie Slaymaker's post about XML::Filter::Dispatcher and XML::Essex. I can only quote some of its highpoints here, but it definitely deserves a careful reading.

XML::Essex is "a prototype of a pull mode scripting environment ... that is also event driven so it allows while ( pull ) style processing without reading the entire document (via Perl's newish ithreads)". That is, it is meant to allow Perl programmers to deal with XML files in much the same way as they'd deal with other text files. It is, Slaymaker says, "an attempt to build a toolkit that allows processing XML files in the same way that you would process a text file, modulo the fact that text files are treated as flat sequences of records in Perl while XML is hierarchical". Again, if you're interested in Bray's lament specifically for its Perl-angle, you should take a careful look at Barrie Slaymaker's recent work.

Resources

Bray pointed to at least two other programmers who complain about the difficulty of handling XML, both of which make interesting and non-overlapping points:

Adam Bosworth's " Speaking XML".
Joe Gregorio's "Regex-able XML"