Listen Print Discuss
Binary Killed the XML Star?

Binary Killed the XML Star?

by Kendall Grant Clark
November 19, 2003

Binary Infoset Workshop Report

There are at least two kinds of topics of permanent conversation in the XML development community: formally settled, and formally unsettled. In other words, members of the XML development community are perpetually discussing, on the one hand, issues which have been, more or less, formally settled by the relevant standards body and, on the other, issues not yet formally settled by the relevant standards body. As the canonical example of the first kind of permathread I tend to think of XML namespaces, which really are just here to stay, plain and simple. As the canonical example of the second kind, I tend to think of binary XML, which may or may not be blessed by the W3C, but which certainly engages the XML developer community in deep and fundamental ways.

In a previous article about this topic in August ("Binary XML, Again"), I concentrated on the degree to which binary XML variants strike directly at the heart of what many XML developers take to be XML's chief advantage, that is, human (really: programmer) readability. While XML is not strongly self-descriptive in the way that many of its proponents claim, it is weakly self-descriptive in a way that many XML developers think of as advantageous, especially over against opaque binary alternatives or equivalents.

The precipitating cause of that article was The W3C Workshop on Binary Interchange of XML Information Item Sets, the report and minutes -- as well as about 40 position papers -- of which have now been published publicly. As Liam Quinn reported on XML-DEV, the workshop concluded that further work -- "of an investigative nature", as the workshop report puts it -- is required before a W3C standard could be made; but the workshop also recommended the formation of a working group.

The workshop focused on pulling together some initial sense of the requirements for a binary variant of XML, as well as some sense of the dominant use cases for a variant. Neither the workshop nor the report consolidated or synthesized the requirements in any really useful way, instead simply presenting a list of 51 requirements. Some of the interesting requirements include: a generic, rather than domain-specific solution; storage and transmission efficiency; prioritize decompression over compression; a minimal performance metric of 10 times faster than the best current textual XML performance; packaging support (something like a binary MIME); versioned delta support; some kind of encoding negotiation; fast arbitrary access to infoset items; work with existing parser APIs; arbitrary specifications of serialization order; oh, and most importantly, it "must be easy to implement". Sure, why not?

Frankly -- and this isn't just the gloomy weather in Washington, DC, today talking -- I find this requirements list to be one of the most depressing XML things I've ever encountered. This seems as much as anything, and especially for some of the biggest players, simply a way to revisit most, if not all of the most fundamental XML design decisions. That possibility, backed by the kind of real world power that of necessity really matters in the W3C, is simply dreadful. The only redeeming note is that the requirements list contains multiple, mutually contradictory elements, which offers some hope that the antitextualists might go off into some corner, far from the rest of us, and tie themselves into knots for a few years. What a welcomed and deserved respite that would be.

Binary SAX

One of the aforementioned requirements was that a binary XML variant should just work with existing APIs, as well as not create any turbulence at the application level. One clear implication of this requirement is that, for example, application code which uses SAX to parse XML should just work -- modulo changes made in the actual SAX libraries, of course -- with any binary variant. Whether or not such a thing is practical or worth the effort is a separate question, of course.

Bob Wyman recently started a detailed, interesting conversation on XML-DEV about this issue. Wyman points to existing efforts: Objective Systems has a "SAX-like interface for ASN.1 defined binary encodings" and "OSS Nokalva is working on a SAX interface for ASN.1 defined encodings".

One of the implementational problems of wedding SAX to binary XML variants, as Wyman puts it, is that SAX assumes all of its input is characters; but any sane binary encoding will, depending on data type information, encode things differently and in a way which is most appropriate. But, as Wyman points out, converting binary into characters so that SAX event handlers can convert some of them back into various kinds of binary is "wasteful silliness". Wyman suggests three possible solutions: first, convert all binary types to strings; second, develop a SAX superset which includes data typing information along with the data itself; third, provide a way to toggle between these two modes.

Simon St.Laurent suggested that, despite the "wasteful silliness", implementing Binary SAX in such a way as to maximize interoperability over against absolute efficiency is the only way to go: "Sure, it's messy, but it's a transition strategy, gets ASN.1 consumers immediate access to a lot of XML toolkits, and helps bridge the cultural gap between ASN.1 and XML."

It's not only the binarists who might want additional type information in some version of SAX. Alaric Snell pointed out that

SAX with typed data would not just be handy to people using binary encodings...people who are transporting, say, dates in XML need to write their own code in the SAX handler that says "Oooh, it's the element <taxPoint> within a <purchase> element? Then pass the string content through the DateParser I've configured to handle the format of date we use in order to convert it to a java.util.Date object for processing".

But pushing schema information into that layer constitutes a serious mistake to some, including St.Laurent:

I'd be thrilled to see ASN.1 readers which produce SAX2 events and ASN.1 writers which consume SAX2 events. I'm not happy to hear notions of PSVI-like typing polluting the SAX2 space. If you want typing, find another API - and accept the costs of doing that. If the ASN.1 community wants to reach out to the XML community, it needs to create ASN.1 tools which talk to XML tools without imposing ASN.1's own and different perspective on how data should be presented.

Conclusion

Also in XML-Deviant

The More Things Change

Agile XML

Composition

Apple Watch

Life After Ajax?

Clearly the antitextualists raise deep technical concerns but also a kind of social concern for the rest of us, that is, those who think XML, as it is, is good enough; or, at least, good enough often enough that the binary variants are likely to be a waste of time, at best. The W3C's workshop report suggests that a possible outcome of a binary working group would be that the W3C chooses not to endorse a recommendation. That seems more possible than likely, however. The idea of a binary variant seems like a fairly radical proposal at what is a relatively late point in the game. It's not clear that the resulting pain and retooling efforts are really worth the gains.

Many XML proponents and users came out of various binary exchange and format camps, and they are very unwilling to return to what were for them, or so it would seem, dark days. In this case, however, given the real power of those who most seem to want a binary variant, they may have to adopt a carefully tactical plan to limit the damage, rather than preventing the fight completely.


Comment on this articleWhat's your position on binary XML? A threat to the foundation of XML, or motherhood and apple pie?
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Titles Only Newest First
  • Why imbed images in XML documents?
    2003-11-29 16:49:03 Michael Maron [Reply]

    Very simple question: why imbed images and other binaries in XML documents? There are hyperlinks for this!

    • Why imbed images in XML documents?
      2003-12-01 04:54:55 Tim Jansen [Reply]

      That's ok for documents on the web. But, for instance, if you create a document with your word prcessor and send it via email, you neither want to put all its images on a HTTP server nor do you want to send 20 files in the mail. Thus you need either to include the images in the XML directly, or some kind of standardized archive that contains the files (effectively MIME structures and DIME are both archives).




  • Performance
    2003-11-28 04:18:55 Tim Jansen [Reply]

    Most of the discussion seems to be centered around the size of the XML representation. gzip is hard to beat at that.
    But the more interesting point in a binary encoding is performance. If XML should not be used for communication between systems, but also for internal communication inside the system (e.g. for SOAP IPC), performance is becoming much more important. And the performance difference between a good binary encoding and text XML is pretty large. CWXML, which is a pretty fast parser that supports both BXML and XML, is at least 3x faster when reading BXML compared to text XML, and it is 30x faster than reading gzip'd XML (http://www.cubewerx.com/main/cwxml/).


    Beside that, I don't think it matters which serialization format is used for the XML data model. The data model is the important part, not the text. The text serialization is quite useful because text happens to be a widespread data format, so you can use existing tools to view and edit XML. I agree with the microsoft position that having to support more than one serialization format would be a bad thing. Every XML processor needs to support the text format. But that doesnt mean that there can't be an alternative in environments where the producer and consumer are able to negotiate the format, like with SOAP over HTTP. For these cases it makes sense to have a single binary format, because otherwise each vendor will have its own format. This would create a situation where a product may be able to interoperate with those of other vendors, but it will be slower.




    • Performance
      2003-12-04 10:52:22 Toni Uusitalo [Reply]

      It might be worth mentioning that CWXML supports only Latin1 encoding. Exerpt from http://www.cubewerx.com/main/HTML/Binary_XML_Encoding.html:


      "The implementation presently uses ISO-8859-1 string encoding internally..."


      I don't try to say that binary XML isn't worth closer look and discussion, but comparing performance of CWXML to for example libxml2 (which is parser with full unicode support) is simply stupid.


      Note that if XML parser isn't unicode aware it means that it doesn't check for example qualified names that XML spesification requires any conformant parser to do.


      When we see CWXML parser (or any binary xml parser) tested against OASIS xml testsuite - with all tests successfully passed of course - then we should take this speed comparition seriously.

      • Performance
        2003-12-04 11:11:17 Toni Uusitalo [Reply]

        Addition: Of course OASIS XML testsuite should be modified to work as binary XML testsuite somehow.

    • Performance
      2003-11-28 04:29:55 Tim Jansen [Reply]

      BTW another problem with gzip'd XML is the lack of random access. You won't see a XML-based format for storing images, sound or video with text-encoded or gzip'd serialization format. It would be simply to slow. A binary XML serialization with support for blobs (like BXML) would make it possible to store even these 'classic' binary formats as XML. The raw bitmap would not be split into elements, of course, but the meta data, allowing XQuery and XPath queries. Unless you want to have at least one other data model beside XML in the future, the use of a binary serialization format is compelling.




  • Funny Article Title
    2003-11-23 05:48:13 Terris Linenbach [Reply]

    +1000 for that article title! It's nice to see some humor bordering/framing what is a divisive topic in the XML community.


    It seems to me, based on personal experience, that SOAP and REST over HTTP are fundamentally flawed for large-data scenarios. The answer to some is "binary XML." But that doesn't have to be the case. The transport layer could deal with this issue and provide perhaps a 80% solution that everyone could agree with.


    Many SOAP/REST toolkits do support zip compression at the HTTP layer. However, from my experience, naive compression just at the http layer doesn't cut the mustard.


    1. I want to transmit a lot of data (I'm in the data warehousing and analytics space)
    2. I don't want the server to have to parse all the data to know how to route my messages
    3. I don't want the client or server to deserialize the entire message into memory (this is a common problem with toolkits)


    I've found SOAP attachments (DIME, soon to be replaced by PASWA) to be very useful, albeit somewhat non-standard and hugely inconvenient. I transmit the "real" message in a gzipped attachment and the "routing" stuff (method name, etc.) in clear text. You may be laughing, but it works, and nothing else does with today's XML technology.


    If it's true that zip/gzip can't decompress into a stream without decompressing everything into a file, then clearly it's advantageous to replace zip/gzip with something else that supports streaming decompression. Surely there must be something off the shelf.


    And now, on to the glorious religious warfare.


    I was very surprised to see a post that stated that schema-based compression (e.g., Sybase's db-lib binary format) is superior to gzip compression of text. At least for his needs. I would like to see more data and research on this topic before I believe the same effect would apply to me.


    I can understand why it's more efficient to read an integer as two bytes instead of via atoi(), but I don't necessarily agree that the compression is superior because compression tends to be a matter of the content rather than the format. If a message is mostly non-repetitive, good luck compressing it! In fact, I've heard from legitimate sources from both Microsoft and Java camps that this is a red herring. In other words, this is the Holy Grail long sought after by the relational database guys, the RPC guys, and, well, just about anyone who had anything to do with distributed systems.


    I guess it's good to see the binary format folks agreeing that interoperability is important. But some have been barking up that tree for a very, very long time. Here is but taste of one example:
    http://lists.ibiblio.org/pipermail/freetds/2002q3/007960.html


    I guess the stakeholders are hoping that somehow the w3c will have the power necessary to resolve all of the territorial arguments that "my format is faster and smaller than yours."


    Again, if that was possible, XML would not exist.


    Anyway, more power to them and good luck!

    • Funny Article Title
      2003-11-27 13:27:49 Anthony Coates [Reply]

      Yes, compression based on the (XML) Schema can indeed be much better than pure textual compression. This applies to data XML where the same message, with little or no structural variation, is transmitted many times. In this situation, the XML markup can be 80% of characters in the message. ZIP/GZIP have to compress the element/attribute names. Schema-based methods produce compress/decompressor pairs (on a per-Schema basis) that already know what the element/attribute names are, and so no bandwidth is wasted on encoding them. For many data messages, this makes a huge difference. For document XML the gains would be less, but for data XML it can be very worthwhile. Cheers, Tony.

  • Binary XML would make my working life easier
    2003-11-22 03:47:48 Anthony Coates [Reply]

    Let's be honest - there are a *lot* of applications which involve small XML messages sent infrequently. For these, I don't see any much to gain from binary XML. On the other hand, there are some areas, such as digital video and finance, where bandwidth is still a major issue. These areas need a compressed method for sending XML, because cheap limitless bandwidth hasn't arrived as quickly as some people expected (particularly on the private circuits used by the financial community).


    For my work in finance, the size of textual XML is the biggest barrier that I come across, and I come across it a lot. Where people are sensitive to what the bandwidth costs, the more compression the better. You can ZIP/GZIP your XML, and that helps, but I'm finding that Schema-sensitive compression (which is what binary XML comes down to in many cases) is typically 5x better, and that's a big saving.


    You also need a binary format that supports streaming decompression. ZIP doesn't (the index is at the end of the file, as I remember). If you are sending XML files with millions of records (as I want to be able to), you don't want to have to decompress that into a multi-gigabyte file if you can avoid it. So a compressed format that streams into SAX is great. I'm also not opposed to the enhancement of SAX to support Schema datatypes. It makes no great sense in a data application to have a compressed format that knows the difference between an integer and a string, but which then decompresses everything into strings so that the application can turn some of them back into integers. So SAX+datatypes would be a great addition. It's not what the document world wants, but it would be a good thing indeed for the data world.


    So, remember, there are some areas that really are disadvantaged by text-only transports for XML. Certainly we don't all need it all of the time. But some of us do need it some of the time.

  • The issue is standardizing a "binary XML" for interoperability
    2003-11-21 01:43:05 Michael Rys [Reply]

    The problem is not necessarily "binary XML". The problem is the notion of making it an additional interoperability standard.


    I gave the presentation of the Microsoft position at the W3C workshop above. And we certainly do not see a value in standardizing a "binary XML" for interoperability (hint: nice to have references to sources, but it may be good to also read them). Having more than one interoperability standards format (even if they claim to be "the same"), fragments the interop story and thus is counter-productive.


    There is value for binary representations of Infosets, XQuery data models etc. for internal processing (database storage, close-coupled transport from storage to APIs and XML feeds). However, these formats will want to be highly optimized for the given architecture and performance scenarios; and these formats are not interested to sacrifice this for the sake of interop. Instead, the APIs and XML itself provide the interop layer.

    • The issue is standardizing a "binary XML" for interoperability
      2003-11-22 13:58:37 Bob Wyman [Reply]

      Michael Rys wrote:
      "However, these formats will want to be highly optimized for the given architecture and performance scenarios; and these formats are not interested to sacrifice this for the sake of interop. "
      It just isn't that "binary." It is simply not true that *everyone* who wants a binary compact encoding is so bit-sensitive that they would refuse to sacrifice some compression in order to get interop and reusable tool sets. In fact, I believe that a large number of people who need compression or faster parsing would view ASN.1 defined encodings as "good enough" 80% solutions and understand that the 20% is worth paying as the cost of interop and a vastly expanded tool set.


      bob wyman


    • The issue is standardizing a "binary XML" for interoperability
      2003-11-21 07:41:13 Michael Champion [Reply]

      I agree with Dare and Michael Rys that the article misrepresented Microsoft's position!


      I'm a bit wishy washy on this. I found Sun's presentation quite interesting; to paraphrase "We want to move our customers from RMI to SOAP, but they're resisting because there is a 10x performance degradation between RMI and JAX-RPC. We did some prototypes substituting ASN.1 for XML text as the SOAP serialization format and got the performance back up to RMI." MS and IBM countered that there are a LOT of other reasons besides XML text parsing that might explain that, and a LOT of room for optimization of XML text parsing that may get the performance back without sacrificing interop.


      I'm wondering about Michael Rys' point: "However, these formats will want to be highly optimized for the given architecture and performance scenarios; and these formats are not interested to sacrifice this for the sake of interop." There is a technical question of whether there are feasible application and platform-neutral formats that are dramatically faster to parse and more compact to store and transmit than XML text is. Clearly it would be a mistake to go off and start a W3C working group to standardize such a thing, but some collaborative research to see if it is possible seemed like a good idea to the workshop participants.


      If, and only if, further research indicates that something like this is feasible, would concrete requirements be specified. At that point it would be interesting to debate the question of whether XML users (broadly defined to include the Infoset-based technologies such as XSLT, XPath, XQuery, DOM) would be better or worse off with alternative Infoset serialization standards.


      Assuming for the sake of argument that a fast/compact alternative is feasible, I don't think the case for one and only one serialization is as open and shut as the MS folks believe. The overwhelming majority of real XML applications don't interoperate in any meaningful way -- perhaps because they require a shared schema, or because they use different character encodings (only UTF-8 and UTF-16 must be supported by XML parsers), or because much more "semantic" information than is encoded in a schema must be shared before data can be usefully processed. Adding an alternative serialization to the mix won't "break" much in the real world, as bad as it sounds in the abstract.

      • The issue is standardizing a "binary XML" for interoperability
        2003-11-21 12:32:01 Michael Rys [Reply]

        As I said at the workshop: If there is a format that covers all the use cases and has such a huge benefit that it outweighs the cost, then I personally (and probably Microsoft) can be swayed to change the position. The only problem I see is that we have done internal research on binary infoset representations for some while now and have not found one that addresses all the requirements in an interoperable way. But then, others may have a better idea and prove me wrong...


        And to Kendall: Thanks for fixing the error. Although now most of the replies seem to be context-free since the offending text does not appear above anymore (thanks to Dare who has it copied below).

  • Who Wants Binary XML?
    2003-11-20 13:42:52 Dare Obasanjo [Reply]

    Kendall Clark wrote


    "Many XML proponents and users came out of various binary exchange and format camps, and they are very unwilling to return to what were for them, or so it would seem, dark days. In this case, however, given the real power of those who most seem to want a binary variant -- including Sun, IBM, and Microsoft -- they may have to adopt a carefully tactical plan to limit the damage, rather than preventing the fight completely"


    I'm curious as to where you came to the conclusion that Microsoft is one of the parties pushing the W3C to come up with a binary variant of XML.


    My reply got to long so I've posted it on my weblog at


    http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=a2065106-11b5-4239-824c-5dcc1b525415

  • I'm not an "antitextualist..." I'm "encoding-neutral."
    2003-11-20 09:21:22 Bob Wyman [Reply]

    Grumble... I really don't like the implication that I'm an "anti-textualist." The reality is that I spend all day, every day, working on XML based systems and see great value in XML. On the other hand, my systems are "encoding independent" so they work just as well with binary encodings as they do with XML. Thus, I'd rather be characterized as "encoding neutral"... My pushing for binary is so that I have the choice available when it makes sense -- often, binary encodings don't make sense...


    Kendall, thanks for taking the time to summarize the continuing discussions re SAX and ASN.1 defined binary encodings. You've done a good job, however, I do wish that you had been able to record some of the consensus on issues that appears to have been generated from the discussion. Most important is the fact that it has been established that ASN.1 encodings can interchange transparently with SAX2 based systems, without API extensions, in both "no-schema" and "shared schema" environments. Thus, from a programmers point of view, as long as they access their data streams via SAX2 or a similar API, they have no need to care whether they are working with textual XML or the same content encoded using an ASN.1 defined binary encoding.


    bob wyman


  • Is it really so bad?
    2003-11-20 02:58:46 Rinie Kervel [Reply]

    If 2 requirements could be fullfilled:
    - ziplike: encoding to and from binary form is lossless and reproduces exactly the text document
    - API / conceptual no change in processing (you proces a binary document, but can think of it as the textual XML document)


    - So for machine processing you get speed and low memory usage.
    - For human readability/ debugging you use the 'unzip' utility


    • Is it really so bad?
      2003-11-24 01:19:53 James Fuller [Reply]

      a few ruminations;


      - transmitting xml in a binary format is a type of transport


      - saving xml in a binary format is a type of long term storage format


      - xml has clearly been defined as text based


      - processing a binary structure that contains xml is something fundamentally 'different' then processing xml, and trying to preserve some sort of analog with xml processing is a bit strange and I think a very deep rathole


      In any event, most of the requirements linked to 'binary' xml is the need to optimise...early optimisation is a definate antipattern.


      The loss in 'human readability' in designing and implementing such binary xml applications will add significant penalties. Lets put our trust that hardware will eventually solve the problem, it always has in the past.




    • Is it really so bad?
      2003-11-20 22:21:15 Mark Wilcox [Reply]

      First -- there's a really good arcticle in XML Developers Journal this month on same topic.


      Second -- I think most of this has to do with a valid question -- how to best transmit XML over a network. The overhead of HTTP for SOAP may be a real limation.


      ASN.1 is nice for transport because it's very easy to map ASN.1 to XML Schemas (because both do essentially the same thing -- map named values to data types).


      I think as things like traditional Web Services grows, we'll eventually see a SOAP stack on the NIC just as we've seen similar optimizations of the TCP stack implementations.


      Remember at one time, people didn't want TCP becuase it was too slow. This was one of the ideas of OSI protocol suite.


      TCP eventaully one because it was easier and less resource intensive.


      I have a feeling XML as text will do the same but ASN.1 will be used for cases where you need the optimizations of binary data transer. Just as we see now on the Net (for example HTTP is all ASCII, LDAP uses ASN.1)
      Mark