Is XML 2.0 Under Development?

January 10, 2007

Micah Dubinko

XML, one has to admit, has been pretty successful. Despite having a sufficient quantity of annoyances to merit a dedicated column on these pages, XML has powered applications almost anywhere--anywhere except the web, if recent murmurings are an indication.

Douglas Crockford, summarizing his talk and resulting hallway conversations at the XML 2006 conference, mentions numerous voices proclaiming that XML on the Web is dead. Some accept this statement, some insist that XML is the one metalangauge to rule them all, and others say, "It still has a role on the server. If we go around saying it's dead, people might start looking for better alternatives." This isn't the first time XML has been declared dead on the Web: back in 2004, Mark Pilgrim made a similar proclamation.

One factor neglected in those statements, however, is the mobile-centric web, where various modularization-based variants of XHTML have quietly lived up to their original premise. Individual browsers vary widely in quality of implementation, but the language itself, including core concepts of strict well-formedness, distinct layering (yeah, I'm talking about you, document.write), and straightforward usability over the Web are alive and healthy in mobile. In that environment, XHTML continues steady advancement, marching over the ashes of WML.

Outside of mobile, though, things look different. It would be a gross exaggeration to say that XHTML was overtaking HTML in practice on the general web. Why the difference? One possible answer is that mobile is a better fit for early adopters; the advantages inherent in XML provide the most bang-for-the-buck there. Folks outside of mobile fit into more of a laggard profile, but eventually they'll come around to the obviously better way. After all, JavaScript has strict syntax rules, with correspondingly harsh consequences for failing to meet them, all of which hasn't slowed down Ajax and similar client-side developments.

It's interesting to ponder why the mobile corners of the Web have bought into XHTML faster than the rest of the Web. Comments from readers are welcome at the end of this article. But the second question remains: does the acceptance of JavaScript mean a victory for the "draconian" error-handling proponents, including strict well-formedness? I think not. There are many ways to write bad pages, but a properly-constructed page will continue to function in some capacity in the event of a script error (or a similar case of a browser with script disabled). A web page with a catastrophic JavaScript error is still a web page, not an error message. In fact, over time, browser manufacturers have made script errors less and less visible, pushing them back into a hidden error console.

So scripting takes an interesting middle ground between draconian processing--the page must be well-formed, else an error message--and chaotic tag soup. Effectively, instead of simply labeling various conditions as a fatal error, and instead of leaving everything wide open for conflicting implementation interpretation, the combination of markup+script defines a processing model that attempts to continue in the face of inevitable errors. This turns out to work much better on the Web. Instead of saying that XML on the Web is dead, it might be more accurate to say that well-formedness on the Web is dead. (Again, with mobile somehow holding out as an exception).

Which brings us back around to the discussion about SGML simplification 2.0. There are plenty of ways to think about XML, but from a historical bent, it's a derivative--a simplification as originally conceived--of SGML. But even before XML was written down, the seeds of a simplified subset of SGML were already sown in a backwater language known as HTML. HTML 2.0, for instance, claims that "HTML documents are SGML documents with generic semantics that are appropriate for representing information from a wide range of domains." In practice, no widely-used browser ever processed HTML via a SGML parser, though several online validators did and perhaps continue to do so, piling on to the confusion. No, in practice, every widely-used browser has a custom hacked parser, that is neither XML nor SGML, but something less elegant and more battle-scarred.

Browser manufacturers were designing and implementing these quirky parsing techniques in parallel, and separately from the design and implementation of XML. (As a quick recap, the first draft of HTML came out in 1993, the first beta of Netscape Navigator in 1994, HTML 2.0 in 1995, the first written draft of XML in 1996, and XML 1.0 in early 1998, in the heat of the browser wars.) As a result, never the twain met. Browser vendors in particular paid attention to all the stupid, incoherent, and counterproductive ways of unstructured authors that rapidly copied each other's questionable habits through the magic of view source. If anything, XML railed against sloppy practice, enshrining even stricter rules, putting it even farther from existing practice on the Web.

So interestingly, the seeds of XML 2.0 were sown, watered, and germinating even before XML 1.0 was a twinkle in Tim Bray's eye. As mentioned previously in the column, Tim Berners-Lee has blogged about a restart of the work for HTML, influenced largely by the WHAT WG, an invitation-only group of browser vendors and interested parties. This will involve messy syntax discussions. Effectively, this is attempt 2.0 at simplifying SGML rules; call it XML2 with the same unofficial naming scheme as "HTML5".

This important point bears repeating: XML2, whatever it is ultimately called, is already under development, has been for some time, and is on its way to being an official W3C Recommendation track document.

This effort, with web-scale implications, has received relatively little attention. In part, this is because the syntax specification isn't a separate document, as is the case with XML, but rather incorporated as one section of a broader spec. This raises some interesting questions.

Is HTML on the Web a special case? Is it possible to have a single general-purpose syntax used for HTML and other things? Or to put it another way, given a language capable of expressing HTML semantics in a way somewhat compatible with present practice on the Web, would that language be useful in other contexts, like publishing, content management, or information exchange? If the answer is yes, the Web would be better served with XML2 embodied as a separate, standalone specification. Imagine if XML had been first specified as a subsection of XHTML. Would it have achieved as much? The hallmark of a well-designed technology is that it gets used in a way the originators didn't envision. This is unlikely to happen with XML2 if the group working on it comes from too narrow a background. Fortunately, people are starting to notice. For example, Sam Ruby has been blogging about his experiences submitting syntax comments, and, as a result, is eliminating some of the arbitrary incompatibility between XML and XML2.

According to TimBL's blog, the new Working Groups will pursue parallel development along HTML and XHTML tracks. This will create further tensions. One thing emphatically rejected by the new SGML-simplifiers is namespace processing, discussed previously. It's difficult to see at this point how far this will get. For example, what issues will arise in an attempt to embed SVG (which in turn embeds XLink) into HTML? Will the entire XML toolset need to be forked for HTML use? What about namespace-centric technologies like XML Schema or XQuery? The group has some difficult decisions ahead.

Likely, as I write this, the charters for the new HTML work are being worked on in a W3C members-only area. If you are associated with a W3C member organization and care about XML, now would be the best time to make your opinions known. Speak now or forever hold your annoyance. Or at least until XML3.