Sign In/My Account | View Cart  
advertisement


Print
Binary XML, Again

Binary XML, Again

by Kendall Grant Clark
August 13, 2003

XML is Not Self-Describing

I dissent from several points of XML Orthodoxy because I am by nature, personal inclination, and experience, a dissident. But I also dissent reflexively, I suppose, because my intellectual training -- in religious studies and the philosophy of religion -- acquainted me well with the dynamics of orthodoxy, heterodoxy, and heresy, rather than, say, the dynamics of ADTs or just-in-time language compilation.

That's not to say that computer scientists are, on average, a conformist lot. Computer programmers, however, tend to be; but since that tendency arises as much from impersonal market forces as from individual personality quirks, I think we can let them, for the most part, slide. One specific locus of XML Orthodoxy that I have never professed is the idea that XML is "self-describing," which seems a rather grandiose and overly strong way of saying that one names XML containers. Well, I've read too much Wittgenstein (not to mention too much Aquinas, Meister Eckhart, and Julian of Norwich) to think that a name is necessarily a self-description.

However, I grant that there may be some marginal utility to be had from naming one's data containers. I am, all other things being equal, slightly better off if I have to do something with data -- in the absence of any other information -- that looks like this

      <equipmentItem>
        <type>centrifuge</type>
        <quantity>1</quantity>
        <manufacturer>Alfa Laval</manufacturer>
        <model>P3000</model>
        <subtype>decanter</subtype>
        <material>stainless</material>
        <drivehp>50</drivehp>
      </equipmentItem>

than I am if I have to do something with data -- again, in the absence of any other information -- that looks like this

centrifuge, 1, Alfa Laval, P3000, decanter, stainless, 50

In other words, a decently designed XML schema pretty much beats tab or comma separated value files every time (though in the simple case above, it's a wash, especially since TSV or CSV files often come with a field header in the first line).

But that advantage does not amount to "self-describing," which, if it means anything coherent at all, means information that is simultaneously information and information about itself: data and metadata that describes its own structure, nature, and identity, all without relying on any additional information and without triggering an infinite regress of higher-orders of description, which in turn require description, which require description.... Well, XML doesn't do that, not by a long shot. But that's okay; I'm not sure that anything can be self-describing in that sense.

I say all of this to point out that, in discussions about binary variants of XML, one of the first claims which gets made by the anti-binary proponents is that XML, as it is, is self-describing and a binary variant wouldn't be. I happen to think that's not only wrong, since XML isn't self-describing in any robust or serious sense, but also that I can't think of any good reason why a binary variant wouldn't be as weakly self-describing as XML already is.

Binary isn't Necessarily Better

The other locus of orthodoxy I dissent from -- though this locus is more often, more fiercely disputed -- is the idea that a well-designed (and semantically equivalent, I suppose I should add) binary variant of XML will perform better than the canonical textual version. Or, to put it more accurately, that parsers and other consumers will perform better (whether this means time or space performance or both, is not always clear) with the binary variant than parsers or consumers of the canonical textual form of XML.

I dissent (provisionally) from this broad class of claims because I have a stubborn empiricist streak. I want to see the numbers which demonstrate the general performance advantages of a binary variant of XML. Whether through my lack of attentiveness or for other reasons, I've never seen numbers which convinced me in the general case. The numbers I have seen on a related topic -- whether textual or binary message forms are superior in application-level message passing and routing systems -- suggested that textual formats, while more verbose, are also more amenable to a wider range of semantic analysis of a sort which may allow for better routing algorithms, thus recouping any performance which might be lost to slimmer, binary variants.

Unlike the self-description boat, which has sailed permanently, in my opinion, I remain open to the possibility that a binary variant of XML could perform better across the board than the canonical textual variety. Even if one day we have solid empirical proof of that claim, we'd still need to decide whether a general, across the board performance advantage was sufficient.

The W3C Goes Binary?

So that's my set of background considerations about this issue. I confess surprise upon learning a few weeks ago that the W3C has escalated the degree to which it's willing to flirt with the idea of a binary XML variant. I refer, of course, to the recently announced (and, ironically, rather verbosely titled) W3C Workshop on Binary Interchange of XML Information Item Sets, to be held in Santa Clara, California, USA, at the end of September.

The workshop announcement is interesting in its own right and worth quoting. Under pressure, I assume, from a "steadily increasing demand," the W3C has decided to get in front of those of its vendor-members which want "to find ways to transmit pre-parsed XML documents and Schema-defined objects, in such a way that embedded, low-memory and/or low bandwidth devices can" get in on the XML game without... -- well, really getting in on the XML game. In other words, major vendors wishing for a binary XML want to have their XML cake without breaking any textual-parser eggs. And what major vendors want, they very often get.

The workshop announcement also mentions a few tantalizing details, including talk of "multiple separate implementers" having some success with an ASN.1 variant of XML. It also, prudently, in my view, mentions the ol' gzip-standby -- in truth, I confess to having a real bias in favor of gzip. If you absolutely must have some kind of binary variant, gzip seems hard to beat since it allows you to pick any three from "decent compression factor", "decent (de)compression performance", and "already implemented everywhere".

The other interesting thing of note here is that the W3C is talking about a binary variant of (parts of) the XML Infoset. What difference that could make remains to be seen, but it's interesting enough to pay some attention to it. There are at least two issues at this workshop -- binary variants and, as the workshop announcement says, "pre-parsed" artifacts -- and they seem orthogonal to each other, such that they really oughtn't to be run together. I can imagine proponents of binary variants of raw XML instances, and I can imagine other factions which support binary representations of Infosets.

I would prefer that these two issues be kept rigorously separated; if I had my way, there'd be two workshops or one workshop with two very distinct tracks. The easiest, cheapest, and quickest way to get a binary variant of XML deployed widely is to have the W3C bless some kind of gzip-raw-XML-instances standard. I doubt there is an similarly painless (well, about as painless as these things ever are...) way to do the Infoset thing.

What do XML Developers Say?

The question of a binary variant of XML is a textus classicus in the XML development community, commonly called a "permathread." Elliotte Rusty Harold makes the case against what he calls an oxymoron, "binary XML," pretty cogently: first, there is no technical advantage to be gained, generally, from a binary variant of XML; second, the only motive for pushing a binary variant is proprietary vendor lock-in ("Text XML is too simple to sell tools for, so they hope that by making it a binary format they can convince programmers to buy their wares," Harold said.) The real threat, to quote Harold, is that "[t]wo years down the line we'll be looking at yet another awful W3C recommendation that confuses user, pollutes the XML space, and makes XML much more complicated for everyone."

Harold also implies a useful way of thinking about this issue. Let's call it the Gzip Test. The only reason to standardize a binary variant of XML is widespread technical need. (That is, in my view, if there is limited need, specific to a subset of a subset of the market, a W3C standard isn't the right thing to go after.) If there were such need, it seems likely that we'd see widespread use of gzip to compress XML documents -- after all, in the absence of a standard, the rational thing to do is the easiest, simplest thing, since the subsequent appearance of a standard will likely dictate retooling, and having done the easiest, simplest thing in the mean time will mean less sunk, unrecoverable cost. But we don't see that at all. In fact, the XML developer community seems for the most part indifferent, where not outright hostile, to this issue.

And gzip also has the added advantage of blunting Harold's proprietary lock-in claims. That is, if vendors are interested in Binary XML for purely technical reasons, why not try a gzip-raw-XML-instances solution for a while? It certainly won't give them any tool or API traction, but if they don't care about that, what's the problem? I submit that we can reasonably draw an inference from the absence of widespread use of gzip'd-raw-XML-instances -- namely, that this isn't a live issue for XML developers, and that Harold is right about the desire of vendors to go proprietary in the XML space.

Liam Quin, who is chairing the workshop in question, suggests that, contrary to the W3C-naysayers, outcomes for these two issues are still unknown: "There's no intent to pre-judge whether W3C (or anyone else) should standardize on a binary interchange or compression format, but rather, an intent to explore whether it makes sense to do so".

I understand the complaints of small or embedded device designers and manufacturers about the overhead of processing XML. However, those concerns seem largely isolated to a subset of a subset of the market. If the W3C is going to issue standards for small market segments, it must do so in such a way that doesn't degrade XML for everyone else. I think that these kinds of small-segments standards should be developed and maintained by an industry-specific, even ad-hoc standards group, not by the W3C. The Web per se is clearly trudging along just fine without a binary variant of XML.




Titles Only Titles Only Newest First
  • It has its uses
    2007-10-29 07:29:58 Haravikk

    I've been working with a proprietary Binary XML format for a while now, and while I agree with a lot of the things mentioned already regarding there being no real need, I am finding a lot of valid applications for it.


    Further, I am intending to produce an application which will run across multiple servers, the aim then is to use either SOAP messages or plain XML (probably the latter, not a big fan of SOAP), this is because should a message become blocked or otherwise require queued, or a connection drops for example, then I can easily write my message as text XML somewhere where it will be readable, using exactly the same code as outputting it to my connection.
    However, I have limited network bandwidth (compared to the throughput of the distributed program) so this makes plain-text XML not as great since it adds overhead and the potential for a bottle-neck.


    In my case, a binary XML format is ideal for communicating between machines/application instances, as it reduces the memory overhead and hopefully reduces the processing of messages that are being actively sent. Also due to the nature of the connection (persistent, sending the XML messages from the same schema) I can get some pretty big savings just by using basic compression techniques.


    So IMO, a standard for streaming XML is fine, being able to stream between languages without porting code would be AWESOME; but as a replacement to XML (ie for saving files) it is completely pointless, since these-days the memory footprint is negligible and it would remove arguably XML's one greatest advantage (readability/ease-of-editing).

  • Why Johnny Can't Gzip...
    2007-01-10 09:27:35 Argent

    "I submit that we can reasonably draw an inference from the absence of widespread use of gzip'd-raw-XML-instances -- namely, that this isn't a live issue for XML developers, and that Harold is right about the desire of vendors to go proprietary in the XML space."


    I submit that the reason for the absence of widespread use of gzipped raw XML is that the people who need a more compact serialization format than raw XML are using something other than XML. There's lots of ways to serialize a data structure... from ASN.1 down to hardcoded bit-level structures like IP headers. The Electronic Arts Interchange File Format and its derivitives are popular: Midi files and PNG are both basically a streamable version of IFF. For data that's organized like a relation rather than a tree there's CSV and other columnar-file formats.


    If XML wants to play in this space it needs not just a binary format, but it needs to abandon the goal of making every chunk of data self-describing (and, of course, it's already missed that boat anyway).

  • RFC 3252(Binary Lexical Octet Ad-hoc Transport)
    2006-01-10 02:35:26 random_

    The ultimate WYSIWYCG(what you see is what your computer gets), BLOAT protocol.
    Seems someone's going to remake ethereal based on XPath, XSLT and other standard technologies.

  • It's Evolution
    2005-12-02 18:53:02 klidl

    Computers and digital media communicate in binary. People communicate using a variety of analog schemes. We've been creating structured tagging schemes to bridge the gap since about an hour and a half after we discovered that binary could be used to manipulate information. I expect that we'll stop when the machines speak our language.


    XML's successor will no doubt sit on a stack that makes it both descriptive and an analog of some underlying binary representation.


    Bring on the binary !! Then let's quickly evolve to a solution that delivers on both the self describing and non-program-centric promise of XML.

  • Embedded Frimware XML and Binary
    2003-10-02 08:28:50 Ken LaBar

    First, I am a newbie to XML, but I am very interested in binary aspects of XML because I wirte embedded firmware that needs to be very small.


    I'm looking for expert help to point me in the right direction.


    My problem:
    Produce self-diagnostic test results that can be uploaded and stored during volume manufaturing. (Computer Hard Drives, 15M drives a quarter). Do it all on a 16 bit processor with limited code space remaining. Push it all to an Oracle DB to track process issues and verify design changes, through a USB 1 port.


    Obviously, I need to save code space in the firmware. That means limiting the number of tags.


    I also need to be able to transfer and store all this data in the factory. Smaller is better.


    My proposed solution:
    Build a self describing XML header and put a standard C, C++ structure between some data tags. See example:



    <testResults>
    <header>
    <testName>BiasCal</testName>
    <codeRev>3.18</codeRev>
    <testTime>0x8EF1</testTime>
    </header>
    <testData length=0x3468 crc=0x9A5F>
    binary data structure goes here
    </testData>
    </testResults>


    This allows a generic structure describing what data it contains, if not describing the data itself. When I pull the data out of the database, I can use the correct data structure based on code revision and test name, set a pointer to the memory (No parsing needed).


    Even though this data is kept internal to the company, I'm trying to be a good XML citizen so more tools can be built around this data. Please share your ideas.


    -Ken

  • How Big Is Enough?
    2003-08-19 06:36:06 Robin Berjon

    Ken, your comments on "self-describability" are right on. I've been writing about precisely this for the past few days, and it has kept bothering me. I too have studied philosophy and using such a term does make me cringe every time.


    Any suggestion of a name that describes the ability to retrieve the node names without recourse to external data, *and* is understood immediately by most, would be much welcome.


    But on to the meat of it. I wish to pick at your claims regarding the "subset of a subset" and the "Web per se".


    How many mobile units shipping with support for such technologies as SVG, XHTML, or SOAP does it take to make you consider it large enough for consideration? I've heard that the US was a bit behind in adopting those, but surely you wouldn't be that culture-centric? How many homes need to have interactive TV set-top boxes to make you happy? How many people need to be using SOAP before it counts?


    And what's that "Web per se" business? Is it only the Web if I'm browsing porn from a beefy desktop box? Do other devices not count?


    We're talking millions of users already. And their content is webbish, or being webized, when it isn't the Web already. And I won't get into the other uses, timed text, X3D, NewsML, GML...


    Should each of those technologies be inventing its own solution? Don't you think they've tried gzip? If they all go their own routes, how will I create content that works for multiple platforms? What are the chances that it'll be royalty-free? How does it deal with language updates? If OMA and 3GPP come up with their own standards one for XHTML and the other for SVG (as was very nearly avoided) how can I mix them?


    I agree that the workshop announcement has some confusing terminology. Well, that's life, it's not a document that needs to stay in the annals of history.


    So we've got a set of varied technologies, all of them using XML, all of them finding issues, working on and with the Web, and having millions of users, with every indication that there are many more to come. Hmmmm. To me, it smells like a good area to produce solutions that span the XML spectrum properly. Besides, for the pleasure of pushing it a little further... audio-video, SVG, X3D, mobile, P2P, nomadic Web Services, etc. that's a bunch of areas where interesting stuff is going on, probably more interesting than the quasi-dead Web-as-just-a-desktop-browser space. And then there are more specific needs such as those for instance of mapping or CAD. They still add their numbers to the lot.


    Creating solutions, whether ad hoc or not, has a cost. Do you think they'd all be asking for binary infosets if gzip worked for them? You touch only on speed and size, both of which are well-solved using gzip for good-bandwidth-fair-power situations, neither of which gzip addresses well enough for those people. And you don't mention things like dynamic update or random access, which solve important problems not addressed by gzip.


    Oh, and since you're the first one to ask for proofs, could you please point me to data that sustains the claims made by ERH that you repeat here? The fact that there is no technical advantage needs benchmarks to be sustained, just as does the opposite claim. "The only motive for pushing a binary variant is proprietary vendor lock-in"? That's a pretty strong claim to be relayed unqualified. Is there proof? If that's the case, what's the point of going to the W3C? In my book, that's called FUD. As for the quality of potential resulting specs, well, I tend to leave WGs with the benefit of the doubt, especially when they currently don't exist... Coming from a heavy Java advocate, I do find that statement somewhat ironic to be honest.


    • How Big Is Enough?
      2003-08-19 07:14:10 Kendall Clark

      --Ken, your comments on "self-describability" are right on...Any suggestion of a name that describes the ability to retrieve the node names without recourse to external data, *and* is understood immediately by most, would be much welcome.--


      Hi Robin. First, if it's all the same to you, my name is "Kendall". Thanks. I think "self-documenting" is pretty good, though that's still not great; maybe "self-naming"? Ick. I personally just let this entire line of "XML advantage" drop, since I don't think it's worth that much, no matter what you call it. I mean, surely, that's not *the* major XML benefit?


      --How many mobile units shipping with support for such technologies as SVG, XHTML, or SOAP does it take to make you consider it large enough for consideration?--


      I don't have a number in mind; but these small devices keep getting more and more powerful. They do all sorts of processing tasks which seem to me way more intensive than processing XML. And they will continue to get more and more powerful, or so it seems safe to conclude.


      A binary variant of XML strikes me as a bad thing to have, all other things being equal.


      --And what's that "Web per se" business? Is it only the Web if I'm browsing porn from a beefy desktop box? Do other devices not count?--


      That's a rather tendentious way of making a point (what point are you making, actually?).


      --Should each of those technologies be inventing its own solution?--


      Yes, perhaps they should, actually. I mean, I keep being told by "people who know" that they need domain-specific compression schemes.


      --Don't you think they've tried gzip?--


      I don't know what they've tried. My point about gzip was that if we want one, general standard for compressing XML content -- given the profile of deployed XML in the world and other considerations -- gzip makes the most sense to me. One, general standard for compression obviously can't optimize for every domain specific data pattern.


      --If they all go their own routes, how will I create content that works for multiple platforms? What are the chances that it'll be royalty-free? How does it deal with language updates?--


      These are some of the concerns I have for *any* binary variant of XML, so I certainly share these concerns. My answer to all of them is "don't do that".


      --And you don't mention things like dynamic update or random access, which solve important problems not addressed by gzip.--


      Yes, the number of things I didn't mention in this column expands about as quickly as the universe itself -- doesn't that make you wonder, not about the quality (or lack thereof) of the column, but rather about the Pandora's Box which you're begging to have opened? At the very least, gimme a break! It's a 1000 word column, and I clearly couldn't have mentioned *everything*.


      --Oh, and since you're the first one to ask for proofs,--


      Bzzzt! Wrong. You can't be said to be doing science or research w/out empirical data to back up your claims. I'm so NOT the first person to come up with that idea.


      --could you please point me to data that sustains the claims made by ERH that you repeat here? The fact that there is no technical advantage needs benchmarks to be sustained, just as does the opposite claim. "The only motive for pushing a binary variant is proprietary vendor lock-in"?--


      I should supply proof of claims that ERH makes, because I repeat them? Is this the *first* XML-Deviant column you've ever read? Do you realize that what you're asking would make writing a column like XML-Deviant impossible?


      And his claim about motivations is probably a different kind of claim, not one which is amenable to empirical warrant anyway, so let's not mix apples and oranges.


      I'm less certain about vendor lock-in than ERH, but no less worried about it.


      --Coming from a heavy Java advocate, I do find that statement somewhat ironic to be honest.--


      Fair enough; why not take it up with him? If I had to defend every person's claim who I quote in an XMl-Deviant column, there wouldn't be any XML-Deviant column.

      • How Big Is Enough?
        2003-08-19 08:01:51 Robin Berjon

        Kendall,


        I certainly have no issue with using your full name, sorry if it bothered you that I didn't do that at first.


        *self-naming*
        That's fine by me. I agree it's not the biggest benefit, but it is one benefit, and one people do not want to lose. Data outlives applications, and those little labels can be terribly useful in such cases.


        *mobile units*
        The number of mobile units in the world is several times that the number of desktop computers. Yes, they do keep getting more powerful, but no that is not sufficient to solve performance issues. Other factors, such as batteries for instance, don't follow Moore's law by any margin. The more CPU you use, the more battery you burn.


        We'd all like to see those problems go away, but wishing them gone doesn't do much.


        *the Web per se*
        The point I'm making is that you're dismissing a large amount of terminals -- again, more than there are desktops -- with a wave of the hand as "a subset of a subset" and not being the real Web.


        Well, they're on the real Web, and there are lots of it. That is just simple factual inadequacy.


        *ad hoc approaches*
        They have been tested for several years now. They work. But they cause no end of interoperability problems, and they've already kept some technologies from being integrated with one another. It is well time that all those that have been doing that got together for a chat.


        I've been working on a way to render the need for domain-specific encoders (often done as codecs in a generic format) pretty much disappear. That's the sort of issue that can be solved in a single place, with all interested parties, much better than vertically where one knows it will fail to be reused by others.


        *gzip*
        Gzip does not solve the issues, full stop. Where it does, it's used. For instance, SVG mandates its support and people use it when it works. However, when you get mobile, mapping, broadcasting, elearning people coming to you saying that it isn't enough for SVG -- even though gzip'd SVG documents are on average smaller than SWF files implementing the same functionality -- well maybe at some point it's worth paying attention. They've tried gzip, it doesn't cut it for them.


        *lock-in concerns*
        Well, surely, in that case one should rejoy to see that sort of activity happen within the W3C rather than a variety of other places!


        *dynamic update and random access*
        What I point out there is that those are oft-cited requirements, which put together with size and speed makes a total of four. I believe that they've been mentionned a sufficient number of times on lists you subscribe to that I'd have hoped you'd have thought about it. It's not very pleasant to see someone take a subset of the requirements you have, point at another solution, and declare victory.


        I don't think four requirements qualifies as opening a Pandora box. In my position paper, in order to encompass as much of the field as possible I've listed two or three others, but they're more marginal.


        *first one to ask for proof*
        In this discussion. On this page. You posted first :) I have vaguely heard of that "science" thing which you mention.


        *ERH's claims*
        There's a difference between just repeating someone's claims and making them almost the sole meat of a section called "What do XML Developers Say?" when those comments are from a single developer, half of which unfounded, the other half blatant FUD. It's your fault if I've been used to more fairness and even-handedness from the Deviant before.


        I've taken those claims up with him, on xml-dev, two weeks ago. I have yet to receive an answer.


  • Binary XML, Again
    2003-08-18 14:08:34 Roopak Parikh

    I agree with Erik that for a long time the data model of xml has been confused with what goes on wire and they are separate issues.
    As long as we adhere to a given SAX/DOM API kind of interface regardless of the actual format for transmission it will hardly matter what is the wire format.


    I do like W3C's initiative and I think its kind of late they should have started it long ago. Gzipping raw xml is not a good option as unzipping consumes both memory and time, which is not desirable when you are working with small devices like PDA/Smart Phones and working with bigger xml files > 2 MB. A binary protocol will definitely solve the problem (actually in my personal experience it has solved the problems reducing the processing time) and I personally support the ASN1/DER encoding for XML.


  • Two issues, really. Why mix them?
    2003-08-18 07:30:02 Erik Wilde

    as kendall correctly points out, the upcoming w3c workshop mixes two orthogonal issues, the question of a binary format, and the question of what xml really is.


    there seems to be an increasing tendency to make the infoset the 'real xml'. i think that a proper information model would be a very smart thing to have, but i also think that the infoset is not the only way to go. and that maybe one should spend some time about making the infoset better (in particular, extensible, for example for being able to handle xml schema's psvi contributions)


    for a long time, when people were asking about xml's 'information model', they were told that the bits on the wire were much more imporant than the model behind them, and that specs like the infoset were for spec writers only. as it increasingly turns out, if each and every new spec of the w3c is based on the infoset, then why not call this (which in essence is a mildly pre-processed subset of xml) the 'real xml' and 'xml 1.0' just a character-based syntax for it? this would make life much easier for many developers, who often think they are using xml, but in reality (through tools such as xslt and xquery) are using the infoset ("why can't i search all cdata sections, dammit!").


    what i want to say (i got a bit carried away, i am afraid...) is that this workshop could be a good starting point to re-align some of the methods (and attitudes) of the past and get on with a proper and helpful separation of information model and representation.

  • You're 3/4 right
    2003-08-14 12:12:52 Tony Parisi

    Kendall, your thoughtful rant is almost totally on the money. You clearly grasp the information science aspects of the XML self-description issue: maybe it's self-documenting, which is a goodness; but outside of providing structural clues, an XML document doesn't do anything to describe itself.


    Also, your insistence on seeing a clear business case for binary XML is fair enough. Else why bother undertaking such a huge enterprise?


    You're obviously a bright guy. How can you be so clueless to not see the value of compressing rich and complex data sets? The world contains far more than text. Gzip compression is simply not adequate for reducing the size of, say, 3D data. Take a look at what we're doing with X3D and you'll see that gzip will never be a satisfactory solution. The key is that rich data such as 3D can be compressed far beyond simple LZ by leveraging domain-specific information with techniques such as quantization, to name just one: if you know all your numbers lie within a certain range you can greatly reduce the space requirements for storage; LZ just can't do anything like that. It only looks for repetitions.


    Oh, and you want numbers? Our preliminary tests in developing a binary format yield compression factors of upwards of 30 to 1. Try doing that with gzip.


    Tony Parisi


    • You're 3/4 right
      2005-01-14 19:51:39 David_Mertz

      30:1 with gzip actually isn't a big deal. Not if the data is highly redundant and structured. But you can improve things a bit by massaging the XML slightly before feeding it to Gzip.


      I wrote a couple articles on this several years back (for Intel and IBM). Namely, rather than define a custom binary format that every tool needs to understand, simply perform a reversible transformation on XML (i.e. compression) for the storage and transmission steps. So the XML writing application and the XML reading application have no need to know anything about the compression. That's all pipelined, in a way invisible to the ends.


      This is generally the same idea as what XMill does. However, the stuff in my articles is public domain and not restricted by any patents. See the articles at:
      http://www-106.ibm.com/developerworks/xml/library/x-matters13.html
      http://www-106.ibm.com/developerworks/xml/library/x-matters19/


      Actually, you can find better formatted versions of the Intel versions at:


      http://gnosis.cx/publish/tech_index_ids.html


      One test file I used in my article was a weblog. Compression results (in part):


      3021921 weblog.xml
      66994 weblog.xml.bz2
      115524 weblog.xml.gz
      83152 weblog.restruct


      The last one is where I do a bit of streamable massaging of the XML (i.e. grouping like tags together with info on how to put them back in the original order) prior to running it through Gzip.


      A moral I take from this is that while my technique is moderately clever, plain old Gzip ALREADY gets almost 30:1. Bzip2 is very slow, but does better still (my restructuring pass is fast though... fast enough to embed in routers to handle realtime traffic (see the above URLs).


    • You're 3/4 right
      2003-08-14 12:47:34 Kendall Clark

      --You're obviously a bright guy. How can you be so clueless to not see the value of compressing rich and complex data sets? The world contains far more than text. Gzip compression is simply not adequate for reducing the size of, say, 3D data. Take a look at what we're doing with X3D and you'll see that gzip will never be a satisfactory solution.--


      Hehe; any time someone starts out by saying nice things about you, you know the hammer is about to fall; and Tony wields it deftly here.


      Let me say, Tony, that I do not take a position on the technical feasability of domain-specific compression schemes. You shouldn't, I think, assume from that absence that I dispute that feasability. My point was that for a *general case* compression scheme, for the bulk of the extant XML, gzip is hard to beat, especially when you take ubiquity factors into account.


      I only care about the general case in the context of a W3C standard because, as I said clearly, domain specific compression schemes ought not be the business of W3C standards, at least that's my view (one which probably isn't shared by the W3C staff and leadership insofar as fees-paying vendors want a special-case W3C-blessed binary variant of XML).


      Hey, this X3D stuff sounds neat. I just happen to think that if X3D vendors want a domain-specific compression scheme for their XML data, bully for them, but that shouldn't come under the auspices of the W3C (for a variety of reasons); or, if it does, it's got nothing to do with this announced workshop (which seems to be taking the general case as its target).


      Thanks for responding.

      • You're 3/4 right
        2003-08-14 16:02:55 Tony Parisi

        Kendall,


        Cool. I was a bit overeager there, and certainly didn't mean to weild any hammers. Glad you took it in the right spirit.


        I'm with you when it comes to general compression: I am not convinced it buys enough space or load time to be worth the hassle. But with the domain-specific stuff there are huge potential savings. The X3D team is going to tackle the problem from that perspective because we really have to; 20x or even 10x is worth the trouble. 20-30% ain't when there's gzip.


        Cheers
        Tony


  • Too many words
    2003-08-14 11:59:15 jonnie savell

    Kendall's demand for performance numbers is ridiculous. The ratio of words to ideas is enormous. This reduces the quality of the article.

    • Too many words
      2003-08-14 12:39:37 Kendall Clark

      I'm not sure what counts as an "idea" for you, but I'll try to do better next time. (And if you think this piece is wordy, you should have seen it before Edd cut it down a bit. So I'm probably guilty as charged there, too.)


      However, I fail to see what is "ridiculous" about asking for empirical data to back up performance claims about binary variants of XML. Frankly, I think that's just the bare minimum of doing good research and engineering.


      As for whether I "demanded" data, I'll let other, perhaps more charitable readers decide. (Which, of course, all begs the question: so what if I *did* "demand" data...Performance claims without some empirical grounds are pretty meaningless, as just about everyone outside the Marketing Department concedes.)





      • Too many words (Kendall has a hammer of his own)
        2003-08-15 13:01:23 jonnie savell

        Rather than defining idea, let us define fluff (the absence of idea): "I dissent from several points of XML Orthodoxy because I am by nature, personal inclination, and experience, a dissident ..." Several spots of fluff I think. Too many words I think (and I fear that I have added too many of my own).


        OK, I got busted for the having made the charge that the demand was ridiculous. "Frankly, I think that's just the bare minimum of doing good research and engineering." Right you are. I was hoping that you would do some of it. I will back off because I have to some good research of my own.


        I get this eery feeling that this discussion risks missing a lot of territory if it continues to focus exclusively on document wide compression schemes.


        I like the xsi:type attribute. I think that the schema folks ought to add some built-in primitive types to their collection. I might like to strip from my application any "to string" conversion because I just don't enjoy converting well defined data types to strings. I might not mind converting my language specific data type format to a language neutral data type format, however. Hey, look at me (and stop laughing):
        ...
        <balance xsi:type="xsd:big_int">... unreadable stuff ...</balance>
        ...


        I avoided converting my data types to strings. I would understand if you felt that I made too big a fuss about this. I too might like more. I would like to tell the validating parser "Hey, ignore the next 16K of data following my <big_data> element. Just make sure that there is a </big_data> tag following the 16K of stuff and that everything is valid after that.


        So, is this pure noise, or is there some signal?




        • Too many words (Kendall has a hammer of his own)
          2003-08-15 13:14:28 Kendall Clark

          --Rather than defining idea, let us define fluff (the absence of idea)...Several spots of fluff I think. Too many words I think...---


          Gee, since you're seriously pursuing the question here, publicly, I will respond in kind. There are, of course, some sentences of setup and teardown in every good column, and while you seem not to enjoy my way of doing that, it isn't fluff or misleading. If you were to try writing a weekly column, you might find that a certain kind of easy, personal rapport with one's audience is a good thing.


          --"Frankly, I think that's just the bare minimum of doing good research and engineering." Right you are. I was hoping that you would do some of it.--


          I'm sorry, but that's just an absurd requirement of a weekly column, of any kind, much less of the XML-Deviant. Perhaps you don't realize but the XML-Deviant column is supposed to survey and report on conversations of general interest in and to the XML developer community.


          Second, and more crucially, the obligation to back up performance claims with data rests with the person who's making the claims. I am *not* one of the people who's agitating for the W3C to define a general compression scheme or a domain-specific one. I don't have any obligation to do research and engineering around an approach I don't find interesting or useful. (Further, I might add, except for pretty rare exceptions, this kind of work is typically done in a funded environment (university, R&D lab, venture-cap backed startup...), *not* by technical press columnists.)


          --So, is this pure noise, or is there some signal?--


          So, you should attend the workshop, the announcement (and some of the issues) of which I reported on in my column. I don't find this to be a pressing issue; but then, as my respondents have informed, me I'm obviously not very bright! :>




  • Binary Vs GZIP
    2003-08-14 09:15:31 Len Bullard

    The VRML community has a long history with GZIP. It even has file name extensions and types to denote that a VRML97 or X3D file is zipped. Because the sizes of the files some years ago were large, zipping became necessary. As bandwidth has improved, it is less necessary but still used. VRML or X3D like most text formats zip well; the bigger problem is as in other formats, images and other non-text media that are used in the multimedia text language.


    I have quoted some comments from Alan Hudson on why the X3D community considers a binary for X3D to be a must have. See


    http://www.xml.com/pub/a/2003/08/06/x3d.html


    Also, along the lines you suggest in your article, the Web3D Consortium has issued an RFP for submissions for binary components and have affirmed their commitment to work with the W3C on this as events warrant. I believe some members of the Web3DC will present on this topic at the upcoming workshop.


    This question seems to revolve around the utility of a generalized binary for XML. It can be shown that for some applications, a binary is useful not only for performance sake, but for a reason you do not touch on: some customers want opaque content and will not pay for complex content unless they have some reasonable protection against theft by view source. Yes, there are no theft-proof formats from simple binarization, but they still insist on it and contend it is good enough protection.

    • Binary Vs GZIP
      2003-08-14 23:15:29 Eric Rehm

      Kendall seems to have completely forgotten about streaming XML applications, e.g., bandwidth sensitive streaming of XML metadata within an MPEG-2 transport stream.


      The requirements here have been carefully analyzed by MPEG and resulted in the MPEG-7 TeM (Text Mode0 and BiM (Binary Mode). Both allow a decoder to implement a parser that can skip elements (e.g., for forward compatibility, SAX, etc.) without waiting for receipt of an entire "document" which can be gunziped. Further, MPEG-7 TeM and BiM allows for incremental building and updating of an infoset.


      MPEG-7 BiM, of course, allows for compression that meets the broadcaster's requirements to us as little bandwidth as possible for metadata.


      Should you scoff at MPEG-7 BiM, note that there is nothing about BiM that is tied to MPEG-7 per se. BiM can be used with any XML Schema.


      See www.expway.fr for more info.


      /eric rehm
      Singingfish / Thomson
      Seattle, WA

    • Binary Vs GZIP
      2003-08-14 12:29:31 Kendall Clark

      <u>some customers want opaque content and will not pay for complex content unless they have some reasonable protection against theft by view source.</u>


      I didn't touch on this, in truth, because it strikes me as completely lame. I believe you, Len, when say there are people (well, let's be clear, you mean *corporations*) who want such a thing. I for one simply have no interest in that usage of the Web. And, I suggest, the W3C shouldn't get into that business.


      As I said, subsets of subsets of various industries are free to define and abide by whatever 'standards' they care to create; what that has to do with the W3C's brief still escapes me.


      Thanks for responding, though -- gee, call the W3C a bunch of cultish nutters, as I did last time, and no one says a word. But suggest that the binary-fetishists should steer clear of the W3C (and that it should steer clear of them) and the sharp knives come out! :>

      • Binary Vs GZIP
        2003-08-16 14:39:23 toto toto

        i'm a beginner, sot it's may be stupid:


        when you want (raster) 2D graphics in a web page, you should use jpeg or gif


        so


        when you want 3D graphics in a web page, you SHOULD use a binary format (and why not, embeded in textual xml format)