Binary Infoset Workshop Report
There are at least two kinds of topics of permanent conversation in the XML development community: formally settled, and formally unsettled. In other words, members of the XML development community are perpetually discussing, on the one hand, issues which have been, more or less, formally settled by the relevant standards body and, on the other, issues not yet formally settled by the relevant standards body. As the canonical example of the first kind of permathread I tend to think of XML namespaces, which really are just here to stay, plain and simple. As the canonical example of the second kind, I tend to think of binary XML, which may or may not be blessed by the W3C, but which certainly engages the XML developer community in deep and fundamental ways.
In a previous article about this topic in August ("Binary XML, Again"), I concentrated on the degree to which binary XML variants strike directly at the heart of what many XML developers take to be XML's chief advantage, that is, human (really: programmer) readability. While XML is not strongly self-descriptive in the way that many of its proponents claim, it is weakly self-descriptive in a way that many XML developers think of as advantageous, especially over against opaque binary alternatives or equivalents.
The precipitating cause of that article was The W3C Workshop on Binary Interchange of XML Information Item Sets, the report and minutes -- as well as about 40 position papers -- of which have now been published publicly. As Liam Quinn reported on XML-DEV, the workshop concluded that further work -- "of an investigative nature", as the workshop report puts it -- is required before a W3C standard could be made; but the workshop also recommended the formation of a working group.
The workshop focused on pulling together some initial sense of the requirements for a binary variant of XML, as well as some sense of the dominant use cases for a variant. Neither the workshop nor the report consolidated or synthesized the requirements in any really useful way, instead simply presenting a list of 51 requirements. Some of the interesting requirements include: a generic, rather than domain-specific solution; storage and transmission efficiency; prioritize decompression over compression; a minimal performance metric of 10 times faster than the best current textual XML performance; packaging support (something like a binary MIME); versioned delta support; some kind of encoding negotiation; fast arbitrary access to infoset items; work with existing parser APIs; arbitrary specifications of serialization order; oh, and most importantly, it "must be easy to implement". Sure, why not?
Frankly -- and this isn't just the gloomy weather in Washington, DC, today talking -- I find this requirements list to be one of the most depressing XML things I've ever encountered. This seems as much as anything, and especially for some of the biggest players, simply a way to revisit most, if not all of the most fundamental XML design decisions. That possibility, backed by the kind of real world power that of necessity really matters in the W3C, is simply dreadful. The only redeeming note is that the requirements list contains multiple, mutually contradictory elements, which offers some hope that the antitextualists might go off into some corner, far from the rest of us, and tie themselves into knots for a few years. What a welcomed and deserved respite that would be.
Binary SAX
One of the aforementioned requirements was that a binary XML variant should just work with existing APIs, as well as not create any turbulence at the application level. One clear implication of this requirement is that, for example, application code which uses SAX to parse XML should just work -- modulo changes made in the actual SAX libraries, of course -- with any binary variant. Whether or not such a thing is practical or worth the effort is a separate question, of course.
Bob Wyman recently started a detailed, interesting conversation on XML-DEV about this issue. Wyman points to existing efforts: Objective Systems has a "SAX-like interface for ASN.1 defined binary encodings" and "OSS Nokalva is working on a SAX interface for ASN.1 defined encodings".
One of the implementational problems of wedding SAX to binary XML variants, as Wyman puts it, is that SAX assumes all of its input is characters; but any sane binary encoding will, depending on data type information, encode things differently and in a way which is most appropriate. But, as Wyman points out, converting binary into characters so that SAX event handlers can convert some of them back into various kinds of binary is "wasteful silliness". Wyman suggests three possible solutions: first, convert all binary types to strings; second, develop a SAX superset which includes data typing information along with the data itself; third, provide a way to toggle between these two modes.
Simon St.Laurent suggested that, despite the "wasteful silliness", implementing Binary SAX in such a way as to maximize interoperability over against absolute efficiency is the only way to go: "Sure, it's messy, but it's a transition strategy, gets ASN.1 consumers immediate access to a lot of XML toolkits, and helps bridge the cultural gap between ASN.1 and XML."
It's not only the binarists who might want additional type information in some version of SAX. Alaric Snell pointed out that
SAX with typed data would not just be handy to people using binary
encodings...people who are transporting, say, dates in XML need to
write their own code in the SAX handler that says "Oooh, it's the
element <taxPoint> within a <purchase> element? Then
pass the string content through the DateParser I've configured to
handle the format of date we use in order to convert it to a
java.util.Date object for processing".
But pushing schema information into that layer constitutes a serious mistake to some, including St.Laurent:
I'd be thrilled to see ASN.1 readers which produce SAX2 events and ASN.1 writers which consume SAX2 events. I'm not happy to hear notions of PSVI-like typing polluting the SAX2 space. If you want typing, find another API - and accept the costs of doing that. If the ASN.1 community wants to reach out to the XML community, it needs to create ASN.1 tools which talk to XML tools without imposing ASN.1's own and different perspective on how data should be presented.
Conclusion
|
Also in XML-Deviant | |
Clearly the antitextualists raise deep technical concerns but also a kind of social concern for the rest of us, that is, those who think XML, as it is, is good enough; or, at least, good enough often enough that the binary variants are likely to be a waste of time, at best. The W3C's workshop report suggests that a possible outcome of a binary working group would be that the W3C chooses not to endorse a recommendation. That seems more possible than likely, however. The idea of a binary variant seems like a fairly radical proposal at what is a relatively late point in the game. It's not clear that the resulting pain and retooling efforts are really worth the gains.
Many XML proponents and users came out of various binary exchange and format camps, and they are very unwilling to return to what were for them, or so it would seem, dark days. In this case, however, given the real power of those who most seem to want a binary variant, they may have to adopt a carefully tactical plan to limit the damage, rather than preventing the fight completely.
What's your position on binary XML? A threat to the foundation of XML, or motherhood and apple pie?
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Full Threads | Newest First |
- Why imbed images in XML documents?
2003-11-29 16:49:03 Michael Maron [Reply]
Very simple question: why imbed images and other binaries in XML documents? There are hyperlinks for this!
- Performance
2003-11-28 04:18:55 Tim Jansen [Reply]
Most of the discussion seems to be centered around the size of the XML representation. gzip is hard to beat at that.
But the more interesting point in a binary encoding is performance. If XML should not be used for communication between systems, but also for internal communication inside the system (e.g. for SOAP IPC), performance is becoming much more important. And the performance difference between a good binary encoding and text XML is pretty large. CWXML, which is a pretty fast parser that supports both BXML and XML, is at least 3x faster when reading BXML compared to text XML, and it is 30x faster than reading gzip'd XML (http://www.cubewerx.com/main/cwxml/).
Beside that, I don't think it matters which serialization format is used for the XML data model. The data model is the important part, not the text. The text serialization is quite useful because text happens to be a widespread data format, so you can use existing tools to view and edit XML. I agree with the microsoft position that having to support more than one serialization format would be a bad thing. Every XML processor needs to support the text format. But that doesnt mean that there can't be an alternative in environments where the producer and consumer are able to negotiate the format, like with SOAP over HTTP. For these cases it makes sense to have a single binary format, because otherwise each vendor will have its own format. This would create a situation where a product may be able to interoperate with those of other vendors, but it will be slower.
- Funny Article Title
2003-11-23 05:48:13 Terris Linenbach [Reply]
+1000 for that article title! It's nice to see some humor bordering/framing what is a divisive topic in the XML community.
It seems to me, based on personal experience, that SOAP and REST over HTTP are fundamentally flawed for large-data scenarios. The answer to some is "binary XML." But that doesn't have to be the case. The transport layer could deal with this issue and provide perhaps a 80% solution that everyone could agree with.
Many SOAP/REST toolkits do support zip compression at the HTTP layer. However, from my experience, naive compression just at the http layer doesn't cut the mustard.
1. I want to transmit a lot of data (I'm in the data warehousing and analytics space)
2. I don't want the server to have to parse all the data to know how to route my messages
3. I don't want the client or server to deserialize the entire message into memory (this is a common problem with toolkits)
I've found SOAP attachments (DIME, soon to be replaced by PASWA) to be very useful, albeit somewhat non-standard and hugely inconvenient. I transmit the "real" message in a gzipped attachment and the "routing" stuff (method name, etc.) in clear text. You may be laughing, but it works, and nothing else does with today's XML technology.
If it's true that zip/gzip can't decompress into a stream without decompressing everything into a file, then clearly it's advantageous to replace zip/gzip with something else that supports streaming decompression. Surely there must be something off the shelf.
And now, on to the glorious religious warfare.
I was very surprised to see a post that stated that schema-based compression (e.g., Sybase's db-lib binary format) is superior to gzip compression of text. At least for his needs. I would like to see more data and research on this topic before I believe the same effect would apply to me.
I can understand why it's more efficient to read an integer as two bytes instead of via atoi(), but I don't necessarily agree that the compression is superior because compression tends to be a matter of the content rather than the format. If a message is mostly non-repetitive, good luck compressing it! In fact, I've heard from legitimate sources from both Microsoft and Java camps that this is a red herring. In other words, this is the Holy Grail long sought after by the relational database guys, the RPC guys, and, well, just about anyone who had anything to do with distributed systems.
I guess it's good to see the binary format folks agreeing that interoperability is important. But some have been barking up that tree for a very, very long time. Here is but taste of one example:
http://lists.ibiblio.org/pipermail/freetds/2002q3/007960.html
I guess the stakeholders are hoping that somehow the w3c will have the power necessary to resolve all of the territorial arguments that "my format is faster and smaller than yours."
Again, if that was possible, XML would not exist.
Anyway, more power to them and good luck!
- Binary XML would make my working life easier
2003-11-22 03:47:48 Anthony Coates [Reply]
Let's be honest - there are a *lot* of applications which involve small XML messages sent infrequently. For these, I don't see any much to gain from binary XML. On the other hand, there are some areas, such as digital video and finance, where bandwidth is still a major issue. These areas need a compressed method for sending XML, because cheap limitless bandwidth hasn't arrived as quickly as some people expected (particularly on the private circuits used by the financial community).
For my work in finance, the size of textual XML is the biggest barrier that I come across, and I come across it a lot. Where people are sensitive to what the bandwidth costs, the more compression the better. You can ZIP/GZIP your XML, and that helps, but I'm finding that Schema-sensitive compression (which is what binary XML comes down to in many cases) is typically 5x better, and that's a big saving.
You also need a binary format that supports streaming decompression. ZIP doesn't (the index is at the end of the file, as I remember). If you are sending XML files with millions of records (as I want to be able to), you don't want to have to decompress that into a multi-gigabyte file if you can avoid it. So a compressed format that streams into SAX is great. I'm also not opposed to the enhancement of SAX to support Schema datatypes. It makes no great sense in a data application to have a compressed format that knows the difference between an integer and a string, but which then decompresses everything into strings so that the application can turn some of them back into integers. So SAX+datatypes would be a great addition. It's not what the document world wants, but it would be a good thing indeed for the data world.
So, remember, there are some areas that really are disadvantaged by text-only transports for XML. Certainly we don't all need it all of the time. But some of us do need it some of the time.
- The issue is standardizing a "binary XML" for interoperability
2003-11-21 01:43:05 Michael Rys [Reply]
The problem is not necessarily "binary XML". The problem is the notion of making it an additional interoperability standard.
I gave the presentation of the Microsoft position at the W3C workshop above. And we certainly do not see a value in standardizing a "binary XML" for interoperability (hint: nice to have references to sources, but it may be good to also read them). Having more than one interoperability standards format (even if they claim to be "the same"), fragments the interop story and thus is counter-productive.
There is value for binary representations of Infosets, XQuery data models etc. for internal processing (database storage, close-coupled transport from storage to APIs and XML feeds). However, these formats will want to be highly optimized for the given architecture and performance scenarios; and these formats are not interested to sacrifice this for the sake of interop. Instead, the APIs and XML itself provide the interop layer.
- Who Wants Binary XML?
2003-11-20 13:42:52 Dare Obasanjo [Reply]
Kendall Clark wrote
"Many XML proponents and users came out of various binary exchange and format camps, and they are very unwilling to return to what were for them, or so it would seem, dark days. In this case, however, given the real power of those who most seem to want a binary variant -- including Sun, IBM, and Microsoft -- they may have to adopt a carefully tactical plan to limit the damage, rather than preventing the fight completely"
I'm curious as to where you came to the conclusion that Microsoft is one of the parties pushing the W3C to come up with a binary variant of XML.
My reply got to long so I've posted it on my weblog at
http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=a2065106-11b5-4239-824c-5dcc1b525415
- I'm not an "antitextualist..." I'm "encoding-neutral."
2003-11-20 09:21:22 Bob Wyman [Reply]
Grumble... I really don't like the implication that I'm an "anti-textualist." The reality is that I spend all day, every day, working on XML based systems and see great value in XML. On the other hand, my systems are "encoding independent" so they work just as well with binary encodings as they do with XML. Thus, I'd rather be characterized as "encoding neutral"... My pushing for binary is so that I have the choice available when it makes sense -- often, binary encodings don't make sense...
Kendall, thanks for taking the time to summarize the continuing discussions re SAX and ASN.1 defined binary encodings. You've done a good job, however, I do wish that you had been able to record some of the consensus on issues that appears to have been generated from the discussion. Most important is the fact that it has been established that ASN.1 encodings can interchange transparently with SAX2 based systems, without API extensions, in both "no-schema" and "shared schema" environments. Thus, from a programmers point of view, as long as they access their data streams via SAX2 or a similar API, they have no need to care whether they are working with textual XML or the same content encoded using an ASN.1 defined binary encoding.
bob wyman
- Is it really so bad?
2003-11-20 02:58:46 Rinie Kervel [Reply]
If 2 requirements could be fullfilled:
- ziplike: encoding to and from binary form is lossless and reproduces exactly the text document
- API / conceptual no change in processing (you proces a binary document, but can think of it as the textual XML document)
- So for machine processing you get speed and low memory usage.
- For human readability/ debugging you use the 'unzip' utility
