Sign In/My Account | View Cart  
advertisement

Article:
 Binary XML, Again
Subject: You're 3/4 right
Date: 2005-01-14 19:51:39
From: David_Mertz
Response to: You're 3/4 right

30:1 with gzip actually isn't a big deal. Not if the data is highly redundant and structured. But you can improve things a bit by massaging the XML slightly before feeding it to Gzip.


I wrote a couple articles on this several years back (for Intel and IBM). Namely, rather than define a custom binary format that every tool needs to understand, simply perform a reversible transformation on XML (i.e. compression) for the storage and transmission steps. So the XML writing application and the XML reading application have no need to know anything about the compression. That's all pipelined, in a way invisible to the ends.


This is generally the same idea as what XMill does. However, the stuff in my articles is public domain and not restricted by any patents. See the articles at:
http://www-106.ibm.com/developerworks/xml/library/x-matters13.html
http://www-106.ibm.com/developerworks/xml/library/x-matters19/


Actually, you can find better formatted versions of the Intel versions at:


http://gnosis.cx/publish/tech_index_ids.html


One test file I used in my article was a weblog. Compression results (in part):


3021921 weblog.xml
66994 weblog.xml.bz2
115524 weblog.xml.gz
83152 weblog.restruct


The last one is where I do a bit of streamable massaging of the XML (i.e. grouping like tags together with info on how to put them back in the original order) prior to running it through Gzip.


A moral I take from this is that while my technique is moderately clever, plain old Gzip ALREADY gets almost 30:1. Bzip2 is very slow, but does better still (my restructuring pass is fast though... fast enough to embed in routers to handle realtime traffic (see the above URLs).


Previous Message Previous Message Move up to Parent Message Up Next Message No Next Message


Sponsored By: