Binary XML, Again

August 13, 2003

XML is Not Self-Describing

I dissent from several points of XML Orthodoxy because I am by nature, personal inclination, and experience, a dissident. But I also dissent reflexively, I suppose, because my intellectual training -- in religious studies and the philosophy of religion -- acquainted me well with the dynamics of orthodoxy, heterodoxy, and heresy, rather than, say, the dynamics of ADTs or just-in-time language compilation.

That's not to say that computer scientists are, on average, a conformist lot. Computer programmers, however, tend to be; but since that tendency arises as much from impersonal market forces as from individual personality quirks, I think we can let them, for the most part, slide. One specific locus of XML Orthodoxy that I have never professed is the idea that XML is "self-describing," which seems a rather grandiose and overly strong way of saying that one names XML containers. Well, I've read too much Wittgenstein (not to mention too much Aquinas, Meister Eckhart, and Julian of Norwich) to think that a name is necessarily a self-description.

However, I grant that there may be some marginal utility to be had from naming one's data containers. I am, all other things being equal, slightly better off if I have to do something with data -- in the absence of any other information -- that looks like this

      <equipmentItem>

        <type>centrifuge</type>

        <quantity>1</quantity>

        <manufacturer>Alfa Laval</manufacturer>

        <model>P3000</model>

        <subtype>decanter</subtype>

        <material>stainless</material>

        <drivehp>50</drivehp>

      </equipmentItem>

than I am if I have to do something with data -- again, in the absence of any other information -- that looks like this

centrifuge, 1, Alfa Laval, P3000, decanter, stainless, 50

In other words, a decently designed XML schema pretty much beats tab or comma separated value files every time (though in the simple case above, it's a wash, especially since TSV or CSV files often come with a field header in the first line).

But that advantage does not amount to "self-describing," which, if it means anything coherent at all, means information that is simultaneously information and information about itself: data and metadata that describes its own structure, nature, and identity, all without relying on any additional information and without triggering an infinite regress of higher-orders of description, which in turn require description, which require description.... Well, XML doesn't do that, not by a long shot. But that's okay; I'm not sure that anything can be self-describing in that sense.

I say all of this to point out that, in discussions about binary variants of XML, one of the first claims which gets made by the anti-binary proponents is that XML, as it is, is self-describing and a binary variant wouldn't be. I happen to think that's not only wrong, since XML isn't self-describing in any robust or serious sense, but also that I can't think of any good reason why a binary variant wouldn't be as weakly self-describing as XML already is.

Binary isn't Necessarily Better

The other locus of orthodoxy I dissent from -- though this locus is more often, more fiercely disputed -- is the idea that a well-designed (and semantically equivalent, I suppose I should add) binary variant of XML will perform better than the canonical textual version. Or, to put it more accurately, that parsers and other consumers will perform better (whether this means time or space performance or both, is not always clear) with the binary variant than parsers or consumers of the canonical textual form of XML.

I dissent (provisionally) from this broad class of claims because I have a stubborn empiricist streak. I want to see the numbers which demonstrate the general performance advantages of a binary variant of XML. Whether through my lack of attentiveness or for other reasons, I've never seen numbers which convinced me in the general case. The numbers I have seen on a related topic -- whether textual or binary message forms are superior in application-level message passing and routing systems -- suggested that textual formats, while more verbose, are also more amenable to a wider range of semantic analysis of a sort which may allow for better routing algorithms, thus recouping any performance which might be lost to slimmer, binary variants.

Unlike the self-description boat, which has sailed permanently, in my opinion, I remain open to the possibility that a binary variant of XML could perform better across the board than the canonical textual variety. Even if one day we have solid empirical proof of that claim, we'd still need to decide whether a general, across the board performance advantage was sufficient.

The W3C Goes Binary?

So that's my set of background considerations about this issue. I confess surprise upon learning a few weeks ago that the W3C has escalated the degree to which it's willing to flirt with the idea of a binary XML variant. I refer, of course, to the recently announced (and, ironically, rather verbosely titled) W3C Workshop on Binary Interchange of XML Information Item Sets, to be held in Santa Clara, California, USA, at the end of September.

The workshop announcement is interesting in its own right and worth quoting. Under pressure, I assume, from a "steadily increasing demand," the W3C has decided to get in front of those of its vendor-members which want "to find ways to transmit pre-parsed XML documents and Schema-defined objects, in such a way that embedded, low-memory and/or low bandwidth devices can" get in on the XML game without... -- well, really getting in on the XML game. In other words, major vendors wishing for a binary XML want to have their XML cake without breaking any textual-parser eggs. And what major vendors want, they very often get.

The workshop announcement also mentions a few tantalizing details, including talk of "multiple separate implementers" having some success with an ASN.1 variant of XML. It also, prudently, in my view, mentions the ol' gzip-standby -- in truth, I confess to having a real bias in favor of gzip. If you absolutely must have some kind of binary variant, gzip seems hard to beat since it allows you to pick any three from "decent compression factor", "decent (de)compression performance", and "already implemented everywhere".

The other interesting thing of note here is that the W3C is talking about a binary variant of (parts of) the XML Infoset. What difference that could make remains to be seen, but it's interesting enough to pay some attention to it. There are at least two issues at this workshop -- binary variants and, as the workshop announcement says, "pre-parsed" artifacts -- and they seem orthogonal to each other, such that they really oughtn't to be run together. I can imagine proponents of binary variants of raw XML instances, and I can imagine other factions which support binary representations of Infosets.

I would prefer that these two issues be kept rigorously separated; if I had my way, there'd be two workshops or one workshop with two very distinct tracks. The easiest, cheapest, and quickest way to get a binary variant of XML deployed widely is to have the W3C bless some kind of gzip-raw-XML-instances standard. I doubt there is an similarly painless (well, about as painless as these things ever are...) way to do the Infoset thing.

What do XML Developers Say?

The question of a binary variant of XML is a textus classicus in the XML development community, commonly called a "permathread." Elliotte Rusty Harold makes the case against what he calls an oxymoron, "binary XML," pretty cogently: first, there is no technical advantage to be gained, generally, from a binary variant of XML; second, the only motive for pushing a binary variant is proprietary vendor lock-in ("Text XML is too simple to sell tools for, so they hope that by making it a binary format they can convince programmers to buy their wares," Harold said.) The real threat, to quote Harold, is that "[t]wo years down the line we'll be looking at yet another awful W3C recommendation that confuses user, pollutes the XML space, and makes XML much more complicated for everyone."

Harold also implies a useful way of thinking about this issue. Let's call it the Gzip Test. The only reason to standardize a binary variant of XML is widespread technical need. (That is, in my view, if there is limited need, specific to a subset of a subset of the market, a W3C standard isn't the right thing to go after.) If there were such need, it seems likely that we'd see widespread use of gzip to compress XML documents -- after all, in the absence of a standard, the rational thing to do is the easiest, simplest thing, since the subsequent appearance of a standard will likely dictate retooling, and having done the easiest, simplest thing in the mean time will mean less sunk, unrecoverable cost. But we don't see that at all. In fact, the XML developer community seems for the most part indifferent, where not outright hostile, to this issue.

And gzip also has the added advantage of blunting Harold's proprietary lock-in claims. That is, if vendors are interested in Binary XML for purely technical reasons, why not try a gzip-raw-XML-instances solution for a while? It certainly won't give them any tool or API traction, but if they don't care about that, what's the problem? I submit that we can reasonably draw an inference from the absence of widespread use of gzip'd-raw-XML-instances -- namely, that this isn't a live issue for XML developers, and that Harold is right about the desire of vendors to go proprietary in the XML space.

Liam Quin, who is chairing the workshop in question, suggests that, contrary to the W3C-naysayers, outcomes for these two issues are still unknown: "There's no intent to pre-judge whether W3C (or anyone else) should standardize on a binary interchange or compression format, but rather, an intent to explore whether it makes sense to do so".

I understand the complaints of small or embedded device designers and manufacturers about the overhead of processing XML. However, those concerns seem largely isolated to a subset of a subset of the market. If the W3C is going to issue standards for small market segments, it must do so in such a way that doesn't degrade XML for everyone else. I think that these kinds of small-segments standards should be developed and maintained by an industry-specific, even ad-hoc standards group, not by the W3C. The Web per se is clearly trudging along just fine without a binary variant of XML.