The Cost of XML
In this week's column, I cover two debates that consider the cost of XML. In the first discussion, the cost is that of file size and processing overhead. In the second, it's actual dollars charged for access to a web service. Also, watch out for the special twilight zone moment as we find ourselves considering CSV files as a serious option.
The overhead or otherwise of using XML is once again a hot topic, as we have seen in recent XML-DEV discussions. The strong resurgence of the debate leads me to consider that XML might be crossing the threshold of yet another order of magnitude in adoption, causing a rush of reconsideration of old issues.
A newcomer to XML, Tedd Sperling asked how best to model tabular data in XML:
In everything I have read, it appears that every chunk of content must be encapsulated by tags, such as:
But what about streams of data, like from a x/y recorder where one may have thousands of pieces of data? Is there some way to wrap this data into a series of comma delimited fields, such as:
<data> 123.456, 234.567, ... </data>
A detailed answer came from no less than Steve DeRose, who gets 10 SGML Old Guard points for managing to mention SHORTREF by the second paragraph of his reply. DeRose goes on to sum up why tagging each data item is useful, even though it seems like tremendous overhead.
1: If your data is "text files that are literally tens of thousands of characters in length", that is small enough that the overhead won't disturb most software running even on a cell phone. If we were talking many millions or billions of *records*, then this would be more of an issue (as it is for some users).
2: If you want the data formatted by CSS or XSL-FO, or transformed by XSLT, or whatever, having all the data in one syntax that the applications *already* know about is much easier than rewriting the applications or working around them to add some syntax (like commas) that they *don't* know about. You'll never have to debug the XML parser you use to parse all those "<data>" tags, but you will spend a lot of time if you try to introduce a new syntax in your process.
3: Any text file that contains zillions of instances of a certain string is necessarily very compressible...
DeRose then went on to give some figures that show, as we have heard before, that XML compression is reasonably competitive with compression of more basic delimited formats.
And then this seemingly well-trodden debate went just a little wild. Enter Stephen Beller, who repeated DeRose's experiment with a spreadsheet. He saved data from Excel into both XML and CSV:
The XML file was 840MB, the CSV 34MB -- a 2,500% difference. Compressed, the XML file was 2.5MB, the CSV 0.00015MB (150KB) -- a 1,670% difference.
Equally dramatic is the time it took to uncompress and render the files as an Excel spreadsheet: It took about 20 minutes with the XML file; the CSV took 1 minute -- a 2,000% difference.
Now reasonable people will be willing to accept some performance difference between XML and CSV for a spreadsheet "filled with a single-digit number." Not the world's most realistic test, and the XML export is likely to contain much more metadata than the CSV export. As Tim Bray implied, Excel's "Save as XML" isn't quite the same as having designed a schema for one's data.
In a further exchange Bill Kearney made the same point about Excel's XML format and also offered the viewpoint that XML's self-documenting nature will stand the test of time better then bald CSV. So just why, asked Kearney, is Beller arguing for CSV?
Beller's response indicates that he accepts the extra power of XML but bemoans the "greater consumption of resources during transport and parsing." And it gets worse, we are told:
And when you throw in all sorts of attributes and formatting instructions, the consumption climbs even more. Hence, the XML backlash. We'd we wise, IMO, to recognize this trade-off and act accordingly.
By now you are probably as agog as I am to find out what, after six years, we really ought to be doing. I'll delay no longer:
There is an elegant solution, which involves using CSV data in novel ways, but it's a proprietary process and this is not the right venue to discuss it.
Elegance! CSV! Proprietary processes!
Bill Kearney certainly wasn't joining the line to pay royalties. Besides, the argument was getting very silly, he said. "What next, railing against using Unicode?" Kearney's quip was just a little more depressingly likely than it was funny. I certainly recall enough U.S.-based developers vociferously unaccepting of the need for anything other than ASCII. But what can you do when seemingly self-evident truths are denied by blinkered zealots?
I'll leave the last word on this strange debate to Mike Kay:
You have totally missed the point, Steve. The benefit of XML is that we no longer have to reinvent clever ways of representing complex data, and can exercise our innovative skills at higher level of the system where it gives a greater return.
With that all said, I can't suppress a somewhat morbid desire to see how XML's expressiveness can be packed into comma separated value files and still remain "elegant." Do CSVs dream of qnames?
An interesting debate blew up this week in the weblog of open source developer Alex Graveley. A programmer working with the GNOME desktop platform, Graveley wanted to create a system-tray notification program that worked with eBay's web services to notify users of the status of their active eBay bids.
Unfortunately, Graveley ran afoul of the current pricing and registration requirements around eBay's service. It seems that even if you join up to eBay's developer program yourself, at a cost of $100, the users of your software must also pay eBay to be able to use your program. Graveley thought this counterproductive and somewhat at odds with eBay's "viral" business model:
Of course the only option for most developers (open or proprietary) given these restrictions is to screen-scrape, completely defeating the stated purpose of the Developer Program.
It's amazing that a large company, built largely around a viral business model can be this hypocritical.
In response to Graveley's post, Ryan Thiessen reckoned that the reason the web service use is so expensive is to contain usage:
I think eBay is just trying to give a monetary incentive for developers to use as few API calls as possible to reduce the load for eBay's servers, which is different tha[n] not allowing open source applications because of any perceived quality difference.
EBay's web services evangelist, Jeff McManus, joined in the conversation, agreeing with Thiessen's diagnosis. McManus points to an entry on his own weblog, where the topic of open source programs against the eBay API is discussed more fully. McManus' position is that he doesn't see why developers should object to paying, as he perceives eBay as being akin to a telephone operator. If you want to use the service, in whatever way, you pay the bill.
In a subsequent post, Graveley reiterates that it's not just a matter of the $100 fee for the developer, but that all users of the software will face a similar fee.
It isn't a one-time fee. It's a per-user $100 fee, plus a multi-phase disconnected registration process that cannot be automated. How much of a percentage drop in purchases could you expect if Ebay charged $100 for a user's first purchase, no matter what?
Really, what do you have to lose by opening up the read-only methods for all to use for free? I mean, it isn't as if people aren't screen-scraping already.
Amazon and Google have similar, though free, read-only APIs, and it's not beyond imagination that for simple things such as checking on bid status, eBay might introduce a similar service. The likely reason this hasn't happened so far seems to be the possible effect of the load on eBay's servers.
This discussion highlights the fact that companies must be wary when introducing public web services. But we also know from cases such as Amazon's that public web services can be fantastic success stories. I sense there really is an opportunity for eBay here if they can come up with a cheaper solution for offering web service access.
This week taken from the RDF Interest list, due to lack of XML-specific announcements,
A somewhat strange XML/RDF vocabulary for describing "resources," which is what I thought RDF did anyway...
Sean Palmer has developed a second implementation of Tim Berners-Lee's N3 notation for RDF.
A change in Microsoft's XML team, but it looks like Software AG's given up on XML ... Apple's chance to influence XQuery ... lest we forget Engelbart ... a perverse brain teaser for the holidays ... 148 messages to XML-DEV last week, 26% XQuery bickering ... Sean McGrath neatly sums up XML vs RDF.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.