|
30:1 with gzip actually isn't a big deal. Not if the data is highly redundant and structured. But you can improve things a bit by massaging the XML slightly before feeding it to Gzip.
I wrote a couple articles on this several years back (for Intel and IBM). Namely, rather than define a custom binary format that every tool needs to understand, simply perform a reversible transformation on XML (i.e. compression) for the storage and transmission steps. So the XML writing application and the XML reading application have no need to know anything about the compression. That's all pipelined, in a way invisible to the ends.
This is generally the same idea as what XMill does. However, the stuff in my articles is public domain and not restricted by any patents. See the articles at:
http://www-106.ibm.com/developerworks/xml/library/x-matters13.html
http://www-106.ibm.com/developerworks/xml/library/x-matters19/
Actually, you can find better formatted versions of the Intel versions at:
http://gnosis.cx/publish/tech_index_ids.html
One test file I used in my article was a weblog. Compression results (in part):
3021921 weblog.xml
66994 weblog.xml.bz2
115524 weblog.xml.gz
83152 weblog.restruct
The last one is where I do a bit of streamable massaging of the XML (i.e. grouping like tags together with info on how to put them back in the original order) prior to running it through Gzip.
A moral I take from this is that while my technique is moderately clever, plain old Gzip ALREADY gets almost 30:1. Bzip2 is very slow, but does better still (my restructuring pass is fast though... fast enough to embed in routers to handle realtime traffic (see the above URLs).
|