Dictionaries and Datagrams

January 24, 2001

Leigh Dodds

This week XML-DEV examined two aspects relating to the textual encoding of XML: verbosity and multilingual elements.

Text Compression

It seems that many developers, used to dealing with binary data formats, are still uncomfortable about embracing text formats like XML. Yet the received wisdom is that switching to a binary format does not offer many advantages. Surprised by this counter-intuitive viewpoint, Mark Papiani caused XML-DEV to debate the merits of binary encodings.

An often quoted disadvantage of XML/HTTP is that it will be less efficient than a binary protocol. It seems intuitive that compressed formats would exhibit better network performance than textual transmission. But perhaps this depends on the size of the chunks of data to be transferred and the processing necessary to encode/decode the data?

This prompted David Megginson to provide some anecdotal evidence supporting the use of text formats.

I recall one of the presenters at XTech last year had been experimenting with using a Java-serialized DOM rather than the source XML document, and was surprised to find that the Java-serialized DOM was both considerably larger than the *uncompressed* XML and considerably slower to load.

Eugene Kuznetsov observed that this is hardly surprising given that parse trees are generally larger than unparsed document instances.

That shouldn't be surprising -- generally speaking a parse tree is larger than the text input to the parser. (And Java serialization doesn't win any awards for efficiency in either time or space).

However, a binary representation of the same data (specified in ASN.1 and encoded using BER, say) would be much smaller and more efficient.

Caroline Clewlow noted that the size increase is relative to the type of content being parsed. Adding quantitative data into the debate, David Megginson gave a worked example of binary versus text encoding. His long message is worth digesting in its entirety (Peter Murray-Rust would describe it as an "XML Jewel"), and its conclusions are worth repeating here:

Even if you could write a one-pass binary-format parser that could run, say, 20% faster than the best XML parsers in the same language, that advantage would make almost no difference to total application time, since, in my experience, actual XML parsing (as opposed to building object trees, DOM or otherwise) accounts for well under 5% of execution time for even the simplest real-world applications. A 20% improvement in parsing time would give you at most a 1% improvement in application execution time. If the document is coming over the net rather than from a local drive, even that small advantage will probably be lost in network latency.

In summary, then, there are some good arguments for writing a navigable XML-like binary format to allow large documents to be processed with a DOM without loading the whole document into memory. I can see very little argument for using a binary format simply for space or efficiency.

Acknowledging the verbosity of XML encoding, Ken North observed that there were many other significant issues to consider in distributed applications.

In one early XML/EDI example, an EDIFACT message ballooned from < 1K to 11K when coded as XML/EDI. However, that's not nearly the most serious bottleneck in distributed applications.

Network latency, flawed application design (e.g., not pooling database connections), poorly-designed databases, and poorly-optimized queries all contribute a greater performance penalty than the difference between using text or binary message formats.

Noting the interoperability benefits of XML, Danny Ayers pointed out that it may also help decouple system components.

In my opinion the interoperability afforded by XML far [outweighs] any minor performance hit, though this is entirely dependent on sensible implementation - you don't want to be converting to and from XML several times over a data path.

The introduction of a binary format raises the spectre of coupling between sections of a system. Using XML means there is a clearly defined interface, so any changes to individual parts of a system can be carried out in relative isolation.

While it seems that there are few performance gains to be had from a binary XML format, there are grounds for compressing XML during data exchange. Mark Beddow highlighted redundancy, as one exploitable property of XML:=.

The apparent textual redundancy of xml tagging means tokenising compressors can really get to work...

The interested reader may care to look at a previous XML Deviant article, "Good Things Come In Small Packages," which reviewed an earlier discussion of compression and binary coding techniques. In addition to this, Stefano Mazzocchi's binary serialisation of SAX and Rick Jelliffe's STAX (Short TAgged XML) are worth examining. STAX is a lightweight compression technique, which Jelliffe believes could form one end of a spectrum of possible approaches.

I think it would be good to have (something like) this kind of ultra-low-end compression available (i.e. as a matter of compression negotiation), because I think many servers are [too] busy to compress data going out (STAX can be generated by the XML-generating API, and read directly into a SAX stream).

I think it would be useful to have several different compression methods widely deployed to suit different situations-- STAX fitting into the extreme low-end.

If anyone is interested in taking this further, I think it would be good. And it is probably the kind of small infrastructure upgrades that could be fun and doable for open-source and collaborative development.

Translation Dictionaries

One property of XML rarely commented upon is the language in which its schemas are expressed, though the Deviant reported on an earlier discussion relating to internationalisation of DTDs and Schemas ("Speaking Your Language"). The debate on XML verbosity prompted Don Park to raise the issue of long, verbose XML element names, noting that a standard like XML-DSIG is precluded from use in mobile, bandwidth-limited applications. This prompted Park to consider the use of abbreviated tag names, outlining some potential topics for discussion.

  1. should schemas be expanded or an alternate version be used?
  2. should a new namespace be defined or old namespace be reused?
  3. what role does RDDL play?
  4. should there be a dynamic abbreviation mechanism? (no, imho)
  5. how should abbreviated version of existing standards be created?
  6. should there be standard rules for abbreviating tag names?

Park later provided an example abbreviated version of the XML-DSIG DTD. This exchange moved Simon St. Laurent to share some thoughts on the use of 'dictionary resources', which have been suggested by the recent RDDL activity.

It seems like there are a substantial number of cases where 1-1 equivalence actually happens in the world - abbreviation and translation being the two largest. I'm pondering (haven't yet built) a thesaurus processor, which lets you feed in a set of rules and specify which set applies, and then run it over documents.

It does less than XSLT and carries less freight than XML Schema equivalence classes, which seems like a good thing to me. I suspect it won't be that hard to implement as a SAX filter, XSLT transform, or DOM processor, though I'm still getting started.

Eric van der Vlist was able to provide an example of what such a dictionary resource, or translation table, might look like.

The most promising aspect to this discussion is that while it covers some old ground (for XML-DEV at least), it comes at a time when the list is already collaboratively producing RDDL, arguably an important piece in the puzzle. While we may be covering ground already a year old, the potential for progress may be greater now, and there is more concrete experience available to draw upon.