XML 2005: Tipping Sacred Cows

November 23, 2005

Micah Dubinko

If there's an overarching theme to XML Annoyances, it's a simple imperative: think! The point isn't to rant against the system, tempting though that might be. During the expansion phase of XML, early adopters converged on a specific set of practices, conventions, and "common knowledge." We'll get to a few specifics in a moment, but for now we'll affectionately call these things sacred cows.

Now, rather than technological expansion, XML is undergoing user base expansion; as a mainstream technology, XML has users of all different stripes and levels of previous experience. Lots of readers are new enough to XML to have recently run into a sacred cow or two. Without the shared formative experiences of some of the old XML pros, new users just get annoyed, usually turning to books, mailing lists, or co-workers for an explanation. All too often, the response new users get is a varyingly polite request to accept things as they are.

I disagree on principle. Often, pushing against a sacred cow yields a surprising new insight or at least a better understanding of a complex reality. With that attitude, I attended this year's major U.S. XML conference, XML 2005 in Atlanta, with a theme of "Syntax to Semantics."

Overall, the conference was less about fireworks and controversy and more about thoughtful contemplation of a maturing technology. Perusing the schedule-at-a-glance, one thing that jumps out is the sheer breadth of topics. XML is everywhere: thesauri and higher education, calendaring, health care and pharma, applications and modeling, hazardous waste management and emergency alerting protocols, financial services, and even artificial intelligence.

As Kurt Cagle writes, XML technology has become the software industry. Indeed, a number of the talks were about the integration of XML into mainstream languages. John Schneider's Wednesday talk on ECMAScript for XML (E4X) was a prime example. Available already in recent builds of Firefox and Actionscript 3, among other places, the language includes native XML support, enabling code like the following:

 <script type="text/javascript; e4x=1">

   var x = <a> <b>Hello</b> <c>world!</c> </a>;

   alert(x.b);  // shows "Hello"


There are lots of additional features and conveniences. I highly recommend that JavaScript programmers take a look. Other languages are moving in a similar direction. Several presentations talked about XML features added to VB9, C#, and Java.

Another feature of this year's conference was a noticeably reduced level of anti-W3C XML Schema ranting, both in sessions and in the hallways. Bob DuCharme's Wednesday talk Your Schema and the Industry-Standard Schema included overviews of both W3C XML Schema and the Relax NG language. Both have their place. Tommie Usdin's Thursday talk W3C XML Schema, RELAX NG, Schematron, or DTD: How's a User to Choose? went even further, forcefully arguing that individuals and individual projects are better off pursing a multiple-schema language strategy from the start.

Querying: One Language or Three?

Which brings us to one of our sacred cows: for decades we've had SQL for relational databases, and soon we'll have XQuery for general XML, and SPARQL for RDF. (For more about SPARQL, check out Leigh Dodds' tutorial.) Commonly accepted wisdom is that these different query systems, each tightly coupled to its underlying data model, are all necessary rather than redundant. Erik Meijer's Wednesday talk XMP Programming Refactored (The Return of the Monoids), after pointing out that DOM in Dutch means "brain-dead," went on to describe the advantages of adding XML directly into programming languages, much as described above. Then the payoff: what if it was possible to construct a generalized query language, loosely coupled enough to work with any underlying data model? The mathematical basis for this was monoids. The presentation didn't actually define this fairly abstract term, only skipping from trivial examples like <Integer, +, 0> or <Boolean, AndAlso, True> to a fully worked representation of a generalized query. Erik's dynamic presentation style is such that I was not able to copy down the full example before he had moved on to the next slide. Whatever the details, it's a valuable topic in that it gets listeners to question their assumptions and see in new ways.

On the other hand, Jim Melton's talk later the same day, SQL, XQuery, and SPARQL: What's Wrong with This Picture? offered a counterpoint. Jim elaborated some of the different assumptions underlying queries against a relational system as opposed to an RDF triple store. His conclusion, applying his vast experience to the situation: SPARQL is not Yet Another Query Language. It has its role and purpose.

XML Infrastructure Developments

In his keynote, Microsoft's Soumitra Sengupta posited that innovation tends to migrate above the level where convergence happens. Coverage of ongoing work on top of the XML core certainly demonstrated these kinds of developments.

Joe Gregorio's Tuesday talk The Atom Publishing Protocol: Publishing Web Content with XML and HTTP gave an overview of protocols, starting with XML-RPC while "the ink on XML was barely dry," through other HTTP POST-centric protocols, up through Atom, now a proposed standard as of August 2005. The key advantage of the Atom protocol comes through the proper use of HTTP, including GET, PUT, and DELETE.

Jon Bosak's Thursday talk UBL Update went over the latest developments around the Universal Business Language. By defining the details of common business payloads, UBL seeks not to replace entrenched EDI systems which have a transaction cost of around $5, but rather paper-systems which have a transaction cost around $30, especially when error-prone re-keying is involved. He called UBL an exercise in "brute standardization," just getting the necessary parties together then hammering out an agreement on the details. UBL 1.0 has been final since November 2004, and the first round of 2.0 schemas will come out before 2006. On the deployment front, Denmark now mandates UBL, saving hundreds of millions of Euros per year, and Sweden's rollout isn't far behind.

A huge audience packed the room to see Brian Jones of Microsoft give the Thursday session Microsoft Office Open XML Formats. The new Office 12 will use the new zipped XML formats by default, with new extensions: .docx instead of .doc, .pptx instead of .ppt, .xlsx instead of .xsl, and so on. Microsoft will provide back-patches for file compatibility for Office versions back to Office 2000. It turns out that the zip format adds a great deal to file robustness, due to the way that files are stored internally. As a demo, Brian used a hex editor to chop the last few thousand bytes off a *.docx file, which corrupted it as a zip file, but Office was still able to recover all the primary content and some of the styles. The session didn't discuss any potential licensing issues with the formats.

Other interesting developments are happening around microformats, a topic previously covered by XML Annoyances. The W3C's Dan Connolly presented on Wednesday Semantic Web Calendaring: RDF Calendar, hCalendar, and GRDDL. One benefit of microformats is that the information they contain can be readily transformed into RDF statements, and the GRDDL specification provides instructions on how to convert ordinary XML into RDF.

Kurt Cagle's Thursday talk Binding the Graphical Web (Component and Data Bindings with XBL, XHTML and SVG) covered past and present developments around various XML binding technologies, including a thoughtful discussion on different classes of abstraction. Key quote: "all programming is a metaphor."

Pipelines and Functional XML

More food for thought. Any conversation involving, say, XSLT and XInclude will probably lead to discussion on the need for an XML processing model and the lack of ordering that afflicts current solutions. The W3C has entertained various proposals for pipeline languages, on what Henry Thompson calls the ODTAO model: One Damn Thing After Another. Henry's Thursday talk Functional XML: A New Approach to XML Processing breaks from tradition. The key observation is that many kinds of XML processing can be expressed like a mathematical function, with an output infoset expressed in terms of an input infoset. These functions can be combined in various ways, using functional techniques that are likely familiar to any XSLT programmer, whether or not they recognize them as "functional programming."

A completely different facet of application pipelines came from Amazon and Steve Rabuchin's Wednesday talk Opening Up: Sharing Data and Technology as a Growth Strategy. In general terms, he talked about the value of customers doing things with Amazon data and technology that they themselves wouldn't have thought of. Primarily, though, the talk was about Amazon Mechanical Turk, named after a famous hoax. It's what they call "artificial artificial intelligence," that is, applying the intelligence of real humans to computing tasks, via web services. The implications of this technology and the resulting marketplace are only just beginning to be understood.

A few other talks fell into the interesting-but-otherwise-hard-to-categorize category. Lotfi Belkhir's Tuesday talk XML Marks the Spot: XML Helps Move Knowledge from Books to Bytes was a product presentation on a book scanner. The technology uses gentle vacuum grippers to turn pages, while high-resolution cameras take images. The system overall is gentler than human hands, and targeted at institutions like the Open Content Alliance that are working through a backlog of bringing 500 years' worth of books into the digital age. Incidentally, the system uses XML throughout, for both configuration and job files, and as a possible output format.

Finally, another presentation that left people thinking was Sam Ruby's Thursday talk "Just" Use XML. He outlined commonly occurring problems in XML, as uncovered through interoperability testing he played a part in. The format was a rapid-fire listing of pitfalls, peppered with vigorous audience feedback.

Things Overheard

Keeping with established tradition, this article will close with some things overheard at the conference:

"It may crash, but I can get it going again pretty quick."

"If the tools we use are too complicated, then they become part of the problem."

"So, where do the zombies come in?"

"I assume it's because he is an android."

"OWLs have relationships."

"Is anyone here from (company name censored)? No? Good, that gives me license to be a little more . . . factual."

"You still owe me a definition of exactness."

"Comprehensive understanding remains a looong waaay off."

"So of course it didn't work, right?"

"Anything done by a committee will have action items."

(Speaker to a fellow employee in the audience) "He's doing bad things!"

(From a vendor on the exhibit floor) "You look like you haven't had any sleep."

(In response to calls of "increase your font size, please") "Everybody wants to be an art director."

On a personal note, congratulations to this year's XML Cup winners, Mike Kay and Norm Walsh. Lauren Wood announced that she'll be retiring as chair of the XML 200x conference series, and she'll be missed. David Megginson, who gave an excellent closing keynote, will chair XML 2006.