Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

XML is Not Self-Describing

I dissent from several points of XML Orthodoxy because I am by nature, personal inclination, and experience, a dissident. But I also dissent reflexively, I suppose, because my intellectual training -- in religious studies and the philosophy of religion -- acquainted me well with the dynamics of orthodoxy, heterodoxy, and heresy, rather than, say, the dynamics of ADTs or just-in-time language compilation.

That's not to say that computer scientists are, on average, a conformist lot. Computer programmers, however, tend to be; but since that tendency arises as much from impersonal market forces as from individual personality quirks, I think we can let them, for the most part, slide. One specific locus of XML Orthodoxy that I have never professed is the idea that XML is "self-describing," which seems a rather grandiose and overly strong way of saying that one names XML containers. Well, I've read too much Wittgenstein (not to mention too much Aquinas, Meister Eckhart, and Julian of Norwich) to think that a name is necessarily a self-description.

However, I grant that there may be some marginal utility to be had from naming one's data containers. I am, all other things being equal, slightly better off if I have to do something with data -- in the absence of any other information -- that looks like this

      <equipmentItem>
        <type>centrifuge</type>
        <quantity>1</quantity>
        <manufacturer>Alfa Laval</manufacturer>
        <model>P3000</model>
        <subtype>decanter</subtype>
        <material>stainless</material>
        <drivehp>50</drivehp>
      </equipmentItem>

than I am if I have to do something with data -- again, in the absence of any other information -- that looks like this

centrifuge, 1, Alfa Laval, P3000, decanter, stainless, 50

In other words, a decently designed XML schema pretty much beats tab or comma separated value files every time (though in the simple case above, it's a wash, especially since TSV or CSV files often come with a field header in the first line).

But that advantage does not amount to "self-describing," which, if it means anything coherent at all, means information that is simultaneously information and information about itself: data and metadata that describes its own structure, nature, and identity, all without relying on any additional information and without triggering an infinite regress of higher-orders of description, which in turn require description, which require description.... Well, XML doesn't do that, not by a long shot. But that's okay; I'm not sure that anything can be self-describing in that sense.

I say all of this to point out that, in discussions about binary variants of XML, one of the first claims which gets made by the anti-binary proponents is that XML, as it is, is self-describing and a binary variant wouldn't be. I happen to think that's not only wrong, since XML isn't self-describing in any robust or serious sense, but also that I can't think of any good reason why a binary variant wouldn't be as weakly self-describing as XML already is.

Binary isn't Necessarily Better

The other locus of orthodoxy I dissent from -- though this locus is more often, more fiercely disputed -- is the idea that a well-designed (and semantically equivalent, I suppose I should add) binary variant of XML will perform better than the canonical textual version. Or, to put it more accurately, that parsers and other consumers will perform better (whether this means time or space performance or both, is not always clear) with the binary variant than parsers or consumers of the canonical textual form of XML.

I dissent (provisionally) from this broad class of claims because I have a stubborn empiricist streak. I want to see the numbers which demonstrate the general performance advantages of a binary variant of XML. Whether through my lack of attentiveness or for other reasons, I've never seen numbers which convinced me in the general case. The numbers I have seen on a related topic -- whether textual or binary message forms are superior in application-level message passing and routing systems -- suggested that textual formats, while more verbose, are also more amenable to a wider range of semantic analysis of a sort which may allow for better routing algorithms, thus recouping any performance which might be lost to slimmer, binary variants.

Unlike the self-description boat, which has sailed permanently, in my opinion, I remain open to the possibility that a binary variant of XML could perform better across the board than the canonical textual variety. Even if one day we have solid empirical proof of that claim, we'd still need to decide whether a general, across the board performance advantage was sufficient.

The W3C Goes Binary?

So that's my set of background considerations about this issue. I confess surprise upon learning a few weeks ago that the W3C has escalated the degree to which it's willing to flirt with the idea of a binary XML variant. I refer, of course, to the recently announced (and, ironically, rather verbosely titled) W3C Workshop on Binary Interchange of XML Information Item Sets, to be held in Santa Clara, California, USA, at the end of September.

The workshop announcement is interesting in its own right and worth quoting. Under pressure, I assume, from a "steadily increasing demand," the W3C has decided to get in front of those of its vendor-members which want "to find ways to transmit pre-parsed XML documents and Schema-defined objects, in such a way that embedded, low-memory and/or low bandwidth devices can" get in on the XML game without... -- well, really getting in on the XML game. In other words, major vendors wishing for a binary XML want to have their XML cake without breaking any textual-parser eggs. And what major vendors want, they very often get.

The workshop announcement also mentions a few tantalizing details, including talk of "multiple separate implementers" having some success with an ASN.1 variant of XML. It also, prudently, in my view, mentions the ol' gzip-standby -- in truth, I confess to having a real bias in favor of gzip. If you absolutely must have some kind of binary variant, gzip seems hard to beat since it allows you to pick any three from "decent compression factor", "decent (de)compression performance", and "already implemented everywhere".

The other interesting thing of note here is that the W3C is talking about a binary variant of (parts of) the XML Infoset. What difference that could make remains to be seen, but it's interesting enough to pay some attention to it. There are at least two issues at this workshop -- binary variants and, as the workshop announcement says, "pre-parsed" artifacts -- and they seem orthogonal to each other, such that they really oughtn't to be run together. I can imagine proponents of binary variants of raw XML instances, and I can imagine other factions which support binary representations of Infosets.

I would prefer that these two issues be kept rigorously separated; if I had my way, there'd be two workshops or one workshop with two very distinct tracks. The easiest, cheapest, and quickest way to get a binary variant of XML deployed widely is to have the W3C bless some kind of gzip-raw-XML-instances standard. I doubt there is an similarly painless (well, about as painless as these things ever are...) way to do the Infoset thing.

What do XML Developers Say?

The question of a binary variant of XML is a textus classicus in the XML development community, commonly called a "permathread." Elliotte Rusty Harold makes the case against what he calls an oxymoron, "binary XML," pretty cogently: first, there is no technical advantage to be gained, generally, from a binary variant of XML; second, the only motive for pushing a binary variant is proprietary vendor lock-in ("Text XML is too simple to sell tools for, so they hope that by making it a binary format they can convince programmers to buy their wares," Harold said.) The real threat, to quote Harold, is that "[t]wo years down the line we'll be looking at yet another awful W3C recommendation that confuses user, pollutes the XML space, and makes XML much more complicated for everyone."

Harold also implies a useful way of thinking about this issue. Let's call it the Gzip Test. The only reason to standardize a binary variant of XML is widespread technical need. (That is, in my view, if there is limited need, specific to a subset of a subset of the market, a W3C standard isn't the right thing to go after.) If there were such need, it seems likely that we'd see widespread use of gzip to compress XML documents -- after all, in the absence of a standard, the rational thing to do is the easiest, simplest thing, since the subsequent appearance of a standard will likely dictate retooling, and having done the easiest, simplest thing in the mean time will mean less sunk, unrecoverable cost. But we don't see that at all. In fact, the XML developer community seems for the most part indifferent, where not outright hostile, to this issue.

And gzip also has the added advantage of blunting Harold's proprietary lock-in claims. That is, if vendors are interested in Binary XML for purely technical reasons, why not try a gzip-raw-XML-instances solution for a while? It certainly won't give them any tool or API traction, but if they don't care about that, what's the problem? I submit that we can reasonably draw an inference from the absence of widespread use of gzip'd-raw-XML-instances -- namely, that this isn't a live issue for XML developers, and that Harold is right about the desire of vendors to go proprietary in the XML space.

Liam Quin, who is chairing the workshop in question, suggests that, contrary to the W3C-naysayers, outcomes for these two issues are still unknown: "There's no intent to pre-judge whether W3C (or anyone else) should standardize on a binary interchange or compression format, but rather, an intent to explore whether it makes sense to do so".

I understand the complaints of small or embedded device designers and manufacturers about the overhead of processing XML. However, those concerns seem largely isolated to a subset of a subset of the market. If the W3C is going to issue standards for small market segments, it must do so in such a way that doesn't degrade XML for everyone else. I think that these kinds of small-segments standards should be developed and maintained by an industry-specific, even ad-hoc standards group, not by the W3C. The Web per se is clearly trudging along just fine without a binary variant of XML.


Comment on this articleWhat do you think of binary encodings for XML? Share your opinions in our forum.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Full Threads Newest First
  • It has its uses
    2007-10-29 07:29:58 Haravikk [Reply]

    I've been working with a proprietary Binary XML format for a while now, and while I agree with a lot of the things mentioned already regarding there being no real need, I am finding a lot of valid applications for it.


    Further, I am intending to produce an application which will run across multiple servers, the aim then is to use either SOAP messages or plain XML (probably the latter, not a big fan of SOAP), this is because should a message become blocked or otherwise require queued, or a connection drops for example, then I can easily write my message as text XML somewhere where it will be readable, using exactly the same code as outputting it to my connection.
    However, I have limited network bandwidth (compared to the throughput of the distributed program) so this makes plain-text XML not as great since it adds overhead and the potential for a bottle-neck.


    In my case, a binary XML format is ideal for communicating between machines/application instances, as it reduces the memory overhead and hopefully reduces the processing of messages that are being actively sent. Also due to the nature of the connection (persistent, sending the XML messages from the same schema) I can get some pretty big savings just by using basic compression techniques.


    So IMO, a standard for streaming XML is fine, being able to stream between languages without porting code would be AWESOME; but as a replacement to XML (ie for saving files) it is completely pointless, since these-days the memory footprint is negligible and it would remove arguably XML's one greatest advantage (readability/ease-of-editing).

  • Why Johnny Can't Gzip...
    2007-01-10 09:27:35 Argent [Reply]

    "I submit that we can reasonably draw an inference from the absence of widespread use of gzip'd-raw-XML-instances -- namely, that this isn't a live issue for XML developers, and that Harold is right about the desire of vendors to go proprietary in the XML space."


    I submit that the reason for the absence of widespread use of gzipped raw XML is that the people who need a more compact serialization format than raw XML are using something other than XML. There's lots of ways to serialize a data structure... from ASN.1 down to hardcoded bit-level structures like IP headers. The Electronic Arts Interchange File Format and its derivitives are popular: Midi files and PNG are both basically a streamable version of IFF. For data that's organized like a relation rather than a tree there's CSV and other columnar-file formats.


    If XML wants to play in this space it needs not just a binary format, but it needs to abandon the goal of making every chunk of data self-describing (and, of course, it's already missed that boat anyway).

  • It's Evolution
    2005-12-02 18:53:02 klidl [Reply]

    Computers and digital media communicate in binary. People communicate using a variety of analog schemes. We've been creating structured tagging schemes to bridge the gap since about an hour and a half after we discovered that binary could be used to manipulate information. I expect that we'll stop when the machines speak our language.


    XML's successor will no doubt sit on a stack that makes it both descriptive and an analog of some underlying binary representation.


    Bring on the binary !! Then let's quickly evolve to a solution that delivers on both the self describing and non-program-centric promise of XML.

  • Embedded Frimware XML and Binary
    2003-10-02 08:28:50 Ken LaBar [Reply]

    First, I am a newbie to XML, but I am very interested in binary aspects of XML because I wirte embedded firmware that needs to be very small.


    I'm looking for expert help to point me in the right direction.


    My problem:
    Produce self-diagnostic test results that can be uploaded and stored during volume manufaturing. (Computer Hard Drives, 15M drives a quarter). Do it all on a 16 bit processor with limited code space remaining. Push it all to an Oracle DB to track process issues and verify design changes, through a USB 1 port.


    Obviously, I need to save code space in the firmware. That means limiting the number of tags.


    I also need to be able to transfer and store all this data in the factory. Smaller is better.


    My proposed solution:
    Build a self describing XML header and put a standard C, C++ structure between some data tags. See example:



    <testResults>
    <header>
    <testName>BiasCal</testName>
    <codeRev>3.18</codeRev>
    <testTime>0x8EF1</testTime>
    </header>
    <testData length=0x3468 crc=0x9A5F>
    binary data structure goes here
    </testData>
    </testResults>


    This allows a generic structure describing what data it contains, if not describing the data itself. When I pull the data out of the database, I can use the correct data structure based on code revision and test name, set a pointer to the memory (No parsing needed).


    Even though this data is kept internal to the company, I'm trying to be a good XML citizen so more tools can be built around this data. Please share your ideas.


    -Ken

  • How Big Is Enough?
    2003-08-19 06:36:06 Robin Berjon [Reply]

    Ken, your comments on "self-describability" are right on. I've been writing about precisely this for the past few days, and it has kept bothering me. I too have studied philosophy and using such a term does make me cringe every time.


    Any suggestion of a name that describes the ability to retrieve the node names without recourse to external data, *and* is understood immediately by most, would be much welcome.


    But on to the meat of it. I wish to pick at your claims regarding the "subset of a subset" and the "Web per se".


    How many mobile units shipping with support for such technologies as SVG, XHTML, or SOAP does it take to make you consider it large enough for consideration? I've heard that the US was a bit behind in adopting those, but surely you wouldn't be that culture-centric? How many homes need to have interactive TV set-top boxes to make you happy? How many people need to be using SOAP before it counts?


    And what's that "Web per se" business? Is it only the Web if I'm browsing porn from a beefy desktop box? Do other devices not count?


    We're talking millions of users already. And their content is webbish, or being webized, when it isn't the Web already. And I won't get into the other uses, timed text, X3D, NewsML, GML...


    Should each of those technologies be inventing its own solution? Don't you think they've tried gzip? If they all go their own routes, how will I create content that works for multiple platforms? What are the chances that it'll be royalty-free? How does it deal with language updates? If OMA and 3GPP come up with their own standards one for XHTML and the other for SVG (as was very nearly avoided) how can I mix them?


    I agree that the workshop announcement has some confusing terminology. Well, that's life, it's not a document that needs to stay in the annals of history.


    So we've got a set of varied technologies, all of them using XML, all of them finding issues, working on and with the Web, and having millions of users, with every indication that there are many more to come. Hmmmm. To me, it smells like a good area to produce solutions that span the XML spectrum properly. Besides, for the pleasure of pushing it a little further... audio-video, SVG, X3D, mobile, P2P, nomadic Web Services, etc. that's a bunch of areas where interesting stuff is going on, probably more interesting than the quasi-dead Web-as-just-a-desktop-browser space. And then there are more specific needs such as those for instance of mapping or CAD. They still add their numbers to the lot.


    Creating solutions, whether ad hoc or not, has a cost. Do you think they'd all be asking for binary infosets if gzip worked for them? You touch only on speed and size, both of which are well-solved using gzip for good-bandwidth-fair-power situations, neither of which gzip addresses well enough for those people. And you don't mention things like dynamic update or random access, which solve important problems not addressed by gzip.


    Oh, and since you're the first one to ask for proofs, could you please point me to data that sustains the claims made by ERH that you repeat here? The fact that there is no technical advantage needs benchmarks to be sustained, just as does the opposite claim. "The only motive for pushing a binary variant is proprietary vendor lock-in"? That's a pretty strong claim to be relayed unqualified. Is there proof? If that's the case, what's the point of going to the W3C? In my book, that's called FUD. As for the quality of potential resulting specs, well, I tend to leave WGs with the benefit of the doubt, especially when they currently don't exist... Coming from a heavy Java advocate, I do find that statement somewhat ironic to be honest.


  • Binary XML, Again
    2003-08-18 14:08:34 Roopak Parikh [Reply]

    I agree with Erik that for a long time the data model of xml has been confused with what goes on wire and they are separate issues.
    As long as we adhere to a given SAX/DOM API kind of interface regardless of the actual format for transmission it will hardly matter what is the wire format.


    I do like W3C's initiative and I think its kind of late they should have started it long ago. Gzipping raw xml is not a good option as unzipping consumes both memory and time, which is not desirable when you are working with small devices like PDA/Smart Phones and working with bigger xml files > 2 MB. A binary protocol will definitely solve the problem (actually in my personal experience it has solved the problems reducing the processing time) and I personally support the ASN1/DER encoding for XML.


  • Two issues, really. Why mix them?
    2003-08-18 07:30:02 Erik Wilde [Reply]

    as kendall correctly points out, the upcoming w3c workshop mixes two orthogonal issues, the question of a binary format, and the question of what xml really is.


    there seems to be an increasing tendency to make the infoset the 'real xml'. i think that a proper information model would be a very smart thing to have, but i also think that the infoset is not the only way to go. and that maybe one should spend some time about making the infoset better (in particular, extensible, for example for being able to handle xml schema's psvi contributions)


    for a long time, when people were asking about xml's 'information model', they were told that the bits on the wire were much more imporant than the model behind them, and that specs like the infoset were for spec writers only. as it increasingly turns out, if each and every new spec of the w3c is based on the infoset, then why not call this (which in essence is a mildly pre-processed subset of xml) the 'real xml' and 'xml 1.0' just a character-based syntax for it? this would make life much easier for many developers, who often think they are using xml, but in reality (through tools such as xslt and xquery) are using the infoset ("why can't i search all cdata sections, dammit!").


    what i want to say (i got a bit carried away, i am afraid...) is that this workshop could be a good starting point to re-align some of the methods (and attitudes) of the past and get on with a proper and helpful separation of information model and representation.

  • You're 3/4 right
    2003-08-14 12:12:52 Tony Parisi [Reply]

    Kendall, your thoughtful rant is almost totally on the money. You clearly grasp the information science aspects of the XML self-description issue: maybe it's self-documenting, which is a goodness; but outside of providing structural clues, an XML document doesn't do anything to describe itself.


    Also, your insistence on seeing a clear business case for binary XML is fair enough. Else why bother undertaking such a huge enterprise?


    You're obviously a bright guy. How can you be so clueless to not see the value of compressing rich and complex data sets? The world contains far more than text. Gzip compression is simply not adequate for reducing the size of, say, 3D data. Take a look at what we're doing with X3D and you'll see that gzip will never be a satisfactory solution. The key is that rich data such as 3D can be compressed far beyond simple LZ by leveraging domain-specific information with techniques such as quantization, to name just one: if you know all your numbers lie within a certain range you can greatly reduce the space requirements for storage; LZ just can't do anything like that. It only looks for repetitions.


    Oh, and you want numbers? Our preliminary tests in developing a binary format yield compression factors of upwards of 30 to 1. Try doing that with gzip.


    Tony Parisi


  • Too many words
    2003-08-14 11:59:15 jonnie savell [Reply]

    Kendall's demand for performance numbers is ridiculous. The ratio of words to ideas is enormous. This reduces the quality of the article.

  • Binary Vs GZIP
    2003-08-14 09:15:31 Len Bullard [Reply]

    The VRML community has a long history with GZIP. It even has file name extensions and types to denote that a VRML97 or X3D file is zipped. Because the sizes of the files some years ago were large, zipping became necessary. As bandwidth has improved, it is less necessary but still used. VRML or X3D like most text formats zip well; the bigger problem is as in other formats, images and other non-text media that are used in the multimedia text language.


    I have quoted some comments from Alan Hudson on why the X3D community considers a binary for X3D to be a must have. See


    http://www.xml.com/pub/a/2003/08/06/x3d.html


    Also, along the lines you suggest in your article, the Web3D Consortium has issued an RFP for submissions for binary components and have affirmed their commitment to work with the W3C on this as events warrant. I believe some members of the Web3DC will present on this topic at the upcoming workshop.


    This question seems to revolve around the utility of a generalized binary for XML. It can be shown that for some applications, a binary is useful not only for performance sake, but for a reason you do not touch on: some customers want opaque content and will not pay for complex content unless they have some reasonable protection against theft by view source. Yes, there are no theft-proof formats from simple binarization, but they still insist on it and contend it is good enough protection.