XML Isn't Too Hard

April 2, 2003

After taking a well-earned break from the XML world last week, during which I attended the Python community conference, PyCon, I've returned rested and ready to tackle the latest issues in the XML development community.

Before setting off for PyCon, I finished an XML-Deviant column ("An XML Hero Reconsiders?") in which I examined the XML development community's reaction to some recent questions posed by Tim Bray about a perennial bugaboo, XML parsing strategies. To my surprise, the conversation about this issue hasn't faded away and warrants another look, especially since feedback from the XML development community and elsewhere has prompted Bray to compose another missive, "Why XML Doesn't Suck".

4 Out of 5 XML Programmers Agree: XML Isn't Too Hard

Uche Ogbuji -- a fellow XML.com columnist and Python devotee -- struck what turned out to be a popular note. Ogbuji suggested that XML has made him more productive, that the idea of using regular expressions to parse XML strikes him as weird, and that Python in particular (using the DOM and Python generators approach he discussed "Generating DOM Magic") makes handling XML easier:

Tim's article just made me roll my eyes. XML has made my life as a developer much easier, and i can't think of a developer whom I've mentored in using XML that doesn't have the same experience.

I find the idea of processing XML using simple regexen pretty hair raising. Holy impedance mismatch! Holy error prone! That's supposed to be easier?

Python has pulldom, although I prefer DOM+Generators, myself. I find them clearer and much easier.

Dimitre Novatchev went further, suggesting that looking at XML in certain lights, and then processing it according to those lights, inevitably leads to programmer pain: "Surely, any attempt to regard a tree of nodes", Novatchev said, "as a linear sequence of 'start tags' and 'end tags' and to process it as such is an inherently masochistic experience".

Whether XML is a "tree of nodes" or merely a "linear sequence" of start and end tags, the point seems to be that choices as to how one thinks about XML imply choices as to how one parses it, some of which may be more painful than others.

Jonathan Robie approached the question differently, asking not whether XML is hard, but whether it's harder or easier than other methods of data representation. "Data is hard for programmers", Robie said, but "XML has made data easier for programmers by giving it a standard, easily processed representation." As both Robie and Bray added, however, that doesn't mean that XML is easier to handle than the native data structures of any particular programming language. And that's precisely why the easiest, if not always the most efficient, way of handling XML in any particular language is the way which is most consonant with that language's way of looking at the world.

Among the recent contributors to this ongoing conversation, Liam Quin defended some of Bray's original claims, particularly the one about parsing XML with regular expressions. What Bray was suggesting, in Quin's view, was that Perl 6's expanded regular expression facilities might make some XML parsing jobs easier. As Quin explained,

Perl 6 "regular expressions" are actually full-blown grammars, with an new and massively clearer syntax. And that's what he referred to. It's more like having a more flexible and more powerful YACC interpreter.

But whether Perl 6 gains an even more powerful regular expression engine, and whether that engine spreads from Perl to other languages, is in some ways beside the point. As Sean McGrath pointed out, the discussion seems to oscillate between two poles, namely, "correctness or input fidelity".

If you process XML with regular expressions, you can know that such processing will not change things like "entity references, whitespace, attribute delimiters" and so on. But you can't know that your processing will work with every well-formed input. On the other hand, if you process XML with an XML parser, you can know that it will work with every well-formed input. But you can't know that it this processing will not alter (or otherwise "negatively effect") "entity references, whitespace, attribute delimiters" and so on. This conundrum, as expressed by Sean McGrath, sets the programmer on a path past the "Scylla and Charybdis" of XML processing, namely, parsing or regular expressions.

The point of McGrath's tale is to wonder whether XML programmers can have both correctness and input fidelity. Several people answered that you could have both, including Simon St. Laurent:

Of course you can have both, if you haven't been lulled to sleep by chants of "Infoset, Infoset" or "XPath is the data model." Heck, you can even have both and deal with the PSVI, if you're that much of a masochist...

The Desperate Perl Hacker has been quite thoroughly betrayed, first by XML 1.0, then by namespaces, then by a variety of other devices that further separated the text from its supposed meaning.

There's nothing inherent in XML or in the languages used to process XML that requires this division. [Most languages are] plenty capable of providing text renditions to accompany events or objects, if anyone thinks it valuable...The problem isn't the code -- it's the will. It certainly takes extra effort.

XML Doesn't Suck?

So where does all this leave us? Clearly, as with any technology decision, there are advantages and disadvantages to parse-processing and regex-processing XML. It is part of the craft of programming to use the right processing tool for the job at hand, as well as to anticipate and adapt to the costs of using that tool, especially when it is custom-built for the job.

One reason that every opinion about XML is not, in the end, equal to every other, as I suggested in the previous column, is that some opinions carry a perceived weight, an authority derived from experience and contribution which must be accounted for. In explaining some ways in which Tim Bray found XML processing harder than it might otherwise be, he prompted all those who reject XML for whatever reason with an opening to say, "See, one of the co-creators of XML hates it too!"

As a way of answering these inevitable, plainly false claims, Bray has written another weblog entry, "Why XML Doesn't Suck", which, by way of concluding this column, I summarize here. XML doesn't suck, in Bray's estimation, because:

It gets internationalization right. Until XML came along, getting fully internationalized data right was more an ideal than a reality. As Bray says,

In XML, there's no ambiguity - a document is a sequence of characters, and characters are numbers, and the numbers mean what Unicode/ISO10646 says they mean. There are lots of different ways to store those numbers as bytes in data files, but XML forces you to say which one you're using right up front.

In a world in which it would be nearly utopian for computer communications across linguistic boundaries to be the chief impediment to free-flowing exchange of ideas and commerce, XML is an important, scarily prescient, and indispensable first step.
It is representationally elastic. In other words, XML can be and is being used to represent an ever-expanding range of kinds of data.

Also in XML-Deviant

The More Things Change

It is syntactically interoperable. Bray makes the point in this way:

XML provides a nice set of syntax rules that you can stick in the face of a recalcitrant vendor and say "you claim to be interoperable? Well, ship me some XML then." And these days, they can't say no, and this is good for everyone.

This belief that bits-on-the-wire is more important than data structures or APIs is at the center of my world-view...
It breaks predictably and reliably. That is, it can be determined, objectively, whether something which someone claims to be XML actually is XML (whether of the well-formed or valid variety); this characteristic, Bray implies, cuts down on the ability of people and organizations to dodge the bullet when things go wrong.
It is or will be long-lived. Information, Bray says, often outlives its representation, which means that forms of representation which are or will be long-lived are inherently more valuable than ones which aren't. XML gives every sign of being long-lived.