Sign In/My Account | View Cart  
advertisement

Article:
 Binary Killed the XML Star?
Subject: Funny Article Title
Date: 2003-11-23 05:48:13
From: Terris Linenbach

+1000 for that article title! It's nice to see some humor bordering/framing what is a divisive topic in the XML community.


It seems to me, based on personal experience, that SOAP and REST over HTTP are fundamentally flawed for large-data scenarios. The answer to some is "binary XML." But that doesn't have to be the case. The transport layer could deal with this issue and provide perhaps a 80% solution that everyone could agree with.


Many SOAP/REST toolkits do support zip compression at the HTTP layer. However, from my experience, naive compression just at the http layer doesn't cut the mustard.


1. I want to transmit a lot of data (I'm in the data warehousing and analytics space)
2. I don't want the server to have to parse all the data to know how to route my messages
3. I don't want the client or server to deserialize the entire message into memory (this is a common problem with toolkits)


I've found SOAP attachments (DIME, soon to be replaced by PASWA) to be very useful, albeit somewhat non-standard and hugely inconvenient. I transmit the "real" message in a gzipped attachment and the "routing" stuff (method name, etc.) in clear text. You may be laughing, but it works, and nothing else does with today's XML technology.


If it's true that zip/gzip can't decompress into a stream without decompressing everything into a file, then clearly it's advantageous to replace zip/gzip with something else that supports streaming decompression. Surely there must be something off the shelf.


And now, on to the glorious religious warfare.


I was very surprised to see a post that stated that schema-based compression (e.g., Sybase's db-lib binary format) is superior to gzip compression of text. At least for his needs. I would like to see more data and research on this topic before I believe the same effect would apply to me.


I can understand why it's more efficient to read an integer as two bytes instead of via atoi(), but I don't necessarily agree that the compression is superior because compression tends to be a matter of the content rather than the format. If a message is mostly non-repetitive, good luck compressing it! In fact, I've heard from legitimate sources from both Microsoft and Java camps that this is a red herring. In other words, this is the Holy Grail long sought after by the relational database guys, the RPC guys, and, well, just about anyone who had anything to do with distributed systems.


I guess it's good to see the binary format folks agreeing that interoperability is important. But some have been barking up that tree for a very, very long time. Here is but taste of one example:
http://lists.ibiblio.org/pipermail/freetds/2002q3/007960.html


I guess the stakeholders are hoping that somehow the w3c will have the power necessary to resolve all of the territorial arguments that "my format is faster and smaller than yours."


Again, if that was possible, XML would not exist.


Anyway, more power to them and good luck!


Previous Message Previous Message   Next Message Next Message


Titles Only Titles Only Oldest First
  • Funny Article Title
    2003-11-27 13:27:49 Anthony Coates [Reply]

    Yes, compression based on the (XML) Schema can indeed be much better than pure textual compression. This applies to data XML where the same message, with little or no structural variation, is transmitted many times. In this situation, the XML markup can be 80% of characters in the message. ZIP/GZIP have to compress the element/attribute names. Schema-based methods produce compress/decompressor pairs (on a per-Schema basis) that already know what the element/attribute names are, and so no bandwidth is wasted on encoding them. For many data messages, this makes a huge difference. For document XML the gains would be less, but for data XML it can be very worthwhile. Cheers, Tony.


Sponsored By: