Kendall, your thoughtful rant is almost totally on the money. You clearly grasp the information science aspects of the XML self-description issue: maybe it's self-documenting, which is a goodness; but outside of providing structural clues, an XML document doesn't do anything to describe itself.
Also, your insistence on seeing a clear business case for binary XML is fair enough. Else why bother undertaking such a huge enterprise?
You're obviously a bright guy. How can you be so clueless to not see the value of compressing rich and complex data sets? The world contains far more than text. Gzip compression is simply not adequate for reducing the size of, say, 3D data. Take a look at what we're doing with X3D and you'll see that gzip will never be a satisfactory solution. The key is that rich data such as 3D can be compressed far beyond simple LZ by leveraging domain-specific information with techniques such as quantization, to name just one: if you know all your numbers lie within a certain range you can greatly reduce the space requirements for storage; LZ just can't do anything like that. It only looks for repetitions.
Oh, and you want numbers? Our preliminary tests in developing a binary format yield compression factors of upwards of 30 to 1. Try doing that with gzip.
30:1 with gzip actually isn't a big deal. Not if the data is highly redundant and structured. But you can improve things a bit by massaging the XML slightly before feeding it to Gzip.
I wrote a couple articles on this several years back (for Intel and IBM). Namely, rather than define a custom binary format that every tool needs to understand, simply perform a reversible transformation on XML (i.e. compression) for the storage and transmission steps. So the XML writing application and the XML reading application have no need to know anything about the compression. That's all pipelined, in a way invisible to the ends.
This is generally the same idea as what XMill does. However, the stuff in my articles is public domain and not restricted by any patents. See the articles at:
http://www-106.ibm.com/developerworks/xml/library/x-matters13.html
http://www-106.ibm.com/developerworks/xml/library/x-matters19/
Actually, you can find better formatted versions of the Intel versions at:
http://gnosis.cx/publish/tech_index_ids.html
One test file I used in my article was a weblog. Compression results (in part):
The last one is where I do a bit of streamable massaging of the XML (i.e. grouping like tags together with info on how to put them back in the original order) prior to running it through Gzip.
A moral I take from this is that while my technique is moderately clever, plain old Gzip ALREADY gets almost 30:1. Bzip2 is very slow, but does better still (my restructuring pass is fast though... fast enough to embed in routers to handle realtime traffic (see the above URLs).
--You're obviously a bright guy. How can you be so clueless to not see the value of compressing rich and complex data sets? The world contains far more than text. Gzip compression is simply not adequate for reducing the size of, say, 3D data. Take a look at what we're doing with X3D and you'll see that gzip will never be a satisfactory solution.--
Hehe; any time someone starts out by saying nice things about you, you know the hammer is about to fall; and Tony wields it deftly here.
Let me say, Tony, that I do not take a position on the technical feasability of domain-specific compression schemes. You shouldn't, I think, assume from that absence that I dispute that feasability. My point was that for a *general case* compression scheme, for the bulk of the extant XML, gzip is hard to beat, especially when you take ubiquity factors into account.
I only care about the general case in the context of a W3C standard because, as I said clearly, domain specific compression schemes ought not be the business of W3C standards, at least that's my view (one which probably isn't shared by the W3C staff and leadership insofar as fees-paying vendors want a special-case W3C-blessed binary variant of XML).
Hey, this X3D stuff sounds neat. I just happen to think that if X3D vendors want a domain-specific compression scheme for their XML data, bully for them, but that shouldn't come under the auspices of the W3C (for a variety of reasons); or, if it does, it's got nothing to do with this announced workshop (which seems to be taking the general case as its target).
Cool. I was a bit overeager there, and certainly didn't mean to weild any hammers. Glad you took it in the right spirit.
I'm with you when it comes to general compression: I am not convinced it buys enough space or load time to be worth the hassle. But with the domain-specific stuff there are huge potential savings. The X3D team is going to tackle the problem from that perspective because we really have to; 20x or even 10x is worth the trouble. 20-30% ain't when there's gzip.