Versioning Problems

December 19, 2001

This week, the XML-Deviant reviews the debate over the publication of the first Working Draft of XML 1.1.

XML 1.1

Since before the XML-Deviant column started in January 2000, many in the community have expressed an interest in the creation of a new version of XML. Some considered XML too complex and wanted it simplified--a recurring theme over the last two years. Others have wanted to see more substantial revisions to incorporate other specifications that have become part of the XML "core." This is especially true for Namespaces in XML, for which there is a strong case for merging with the XML 1.0 spec, as the combination forms the basis for other later works.

Yet the most recent (and so far, only) revision to the XML Recommendation in October 2000 was when the W3C incorporated a number of errata into the specification, releasing it as XML 1.0 2nd Edition--ignoring desires to see an attempt at something more daring.

This week saw the publication of the first Working Draft of XML 1.1. XML developers may well have felt their pulses rise at the pointy-bracketed pleasures that such a draft might contain. Disappointing, then, to discover the dismal taste of Blueberry on their palates.

A Taste of Blueberry

Regular XML-Deviant readers will remember that Blueberry was a sour summer fruit this year, causing a great deal of debate on the XML-DEV mailing list. The Blueberry requirements had a couple of aims: to extend the Unicode support to Unicode 3.1, and to accommodate some other changes to legal characters that would make life easier for some IBM mainframe users. XML is currently based on Unicode 2.0, and a number of new scripts have been added in later revisions of Unicode that are not currently legal to use in some aspects of XML markup, particularly tag names. This prevents truly internationalized markup.

The debate, inevitably, ended up with some polarized viewpoints: those who believed that full internationalization was a worthwhile goal in itself, regardless of implementation costs, and those who saw those costs as too high considering the small returns involved. Alternatives were presented: IBM mainframe users could incur the cost to upgrade their text editors, and add some simple character conversions when producing XML from legacy systems; and international users would have to accept less freedom in choosing tag names than they have in the character content they can mark-up with those tags.

A different approach to meeting the requirements was also suggested by James Clark: Totally remove the dependencies of XML on Unicode, and make it a factor of validity rather than well-formedness. This is roughly where the Blueberry debate dried up, although an amended specification was released towards the end of September that included a slightly expanded introduction.

Nice Idea, Poorly Realized

The XML 1.1 Working Draft is, then, the first concrete attempt to realize the XML Blueberry requirements. A new version number has been defined to mark the fact that the basic notion of well-formedness will change. Not surprisingly, a few people have felt slightly shortchanged by the contents of the specification, because the revision label "1.1" suggests a great deal more. Michael Kay was the first to express surprise that Namespaces were still not to be included in the revision.

The reaction from others was more extreme. Rick Jelliffe signed-off a message to the list with ROFL (Rolls On Floor Laughing) and urged developers to simply "ignore it."

I see XML 1.1 is out, and it is so crazy that it is funny. My considered recommendation is we all have a good laugh, and then forget about it.

Jelliffe highlighted what he saw as several laughable aspects of the draft, paying particular attention to some of the consequences of the changes to accommodate more of Unicode. Certainly one side effect is some rather weird and wonderful ways of naming elements.

Tim Bray also posted a lengthy series of comments on the specification, stating in his opening remarks that:

The principle of decoupling the XML spec from successive revisions of Unicode is the only sensible way forward.
If no consensus can be built around the details of this set of changes, it would be acceptable to declare defeat and go on with XML 1.0 2nd Ed. as-is. This would be a regrettable outcome, but not fatal at a deep level.

Bray disagreed with many changes, including the addition of the IBM line-ending character, and, more fundamentally, he questioned the entire approach adopted by the draft.

There really needs to be some deep discussion in this document of why this alternative was chosen. When I look at some of the wildly unlikely things that are allowed to appear in names, the obvious question is: Why not rely on the Unicode properties database? In particular, this allows lots of Name characters that are not in fact Unicode characters at all and probably never will be.

In his earlier posting, Rick Jelliffe also suggested that using the Unicode properties database would be a cleaner approach.

While the subsequent discussion rehashed much of the previous Blueberry debate, particularly the back-and-forth over the "correct" way to handle line-breaks, there is definite support for revising the way the relationship between Unicode and XML works. Interestingly, members of the Ethiopic XML Working Group have published a fascinating document that presents their desire to have the capability for fully native markup. The document asserts that simple transliteration for element names isn't sufficient and can create ambiguities in the markup. They also suggest that the .NET initiative will drive adoption of XML, and hence an increasing need for native markup. Telemedicine is one usage scenario where they see XML adding value in developing nations.

A different approach to evaluating the impact of the specification was suggested by Eric van der Vlist who noted that:

. . . [it] presented an opportunity to test the versioning of XML on a limited change and there are probably lots of things to learn from this first version change.

Van der Vlist reviewed the implications of XML 1.1 with respect to other W3C Recommendations, including XSLT, XPath, and W3C XML Schema, highlighting a number of ambiguities and areas in which these specifications will be impacted. No doubt one of the lessons to be learned here is that changes can ripple outwards in all sorts of unexpected directions, particularly with interrelated specifications.

An Alternate Proposal

Sufficiently dissatisfied with XML 1.1 as proposed, Rick Jelliffe was moved to start preparing an alternative, inviting other list members to provide feedback. Jelliffe's proposal takes a similar tack to James Clark's earlier suggestion.

The basic idea of this is, following the idea attributed to James Clark, that we may as well put in some kind of layer to bring out character issues in XML. Actually, I take the reverse [position]: we pull out character issues to make a lightweight version of XML. Where the current draft is very wrong is in that it throws out the naming rules entirely, rather than shifting them to where they are appropriate: as part of validation.

The proposal not only meets all the Blueberry requirements, but Jelliffe claimed that it also simplifies XML and even improves parsing rates for non-ASCII names. Not usually a supporter of simplification exercises, Jelliffe was moved to justify his reasoning.

Why am I proposing simplification when I am often on the ultra-conservative side? Well, it is simplification of parsing techniques that brings out something that was designed into XML 1.0: Whitespace and delimiters are all that really is needed for parsing. And we already have a mode for debugging and QA of XML: validation. Name-checking (and its vitally important side effect, that transcoding is verified by name-checking) can be made part of validation without sacrificing much. We are not changing the language, just refactoring where checks should occur in a way that better suits high-volume processing and small devices.

The proposal has been well-received. A long-term advocate of simplification and refactoring, Mike Champion highlighted that Jelliffe's proposal delegates the handling of character encoding issues to those that know best: the Unicode organization.

I like the *concept* that XML 1.1 will get the W3C out of the business of defining semantics for character codes. Good design (and management) practice suggests delegating decisions about details to the experts, and it would seem that the Unicode folks are the experts here.

Plans in the Pipe

Also in XML-Deviant

The More Things Change

Organizations getting on with the things that they do well is a common theme this week, with the announcement that ISO has formed a working group to define Document Schema Definition Language (DSDL). DSDL will have the potential to make several schema languages ISO standards.

DSDL will define a pipeline processing model that will rationalize the current range of XML schema languages under a common architecture. The draft overview document decomposes DSDL into seven pieces:

A pipeline framework
Grammar-oriented schema languages (based on RELAX NG)
Primitive data type semantics (based on Part 2 of W3C XML Schema)
Path-based integrity constraints (based on Schematron)
Object-oriented schema languages (based on W3C XML Schema)
Information item manipulation, e.g., defaulting attributes
Namespace-aware processing with DTD syntax

Makoto Murata, James Clark, and Rick Jelliffe are already onboard to edit their respective sections, with Ken Holman acting as the overall editor.

Some of the most interesting, recent work in XML schemas has involved exactly this kind of multi-schema interaction. (See the Schemarama discussion covered in an earlier Deviant for some background on this.) Pipeline architectures, e.g., XPipe, are also a hot topic at the moment.

DSDL, combining these two themes, has real potential to become the basic processing model for many XML applications, of which the W3C's own effort is but one facet. (No pun intended!) It is ironic that "glacially slow" ISO seems to have embraced this direction more quickly than the W3C, which seems to become repeatedly mired in controversy. It will be interesting to see whether the new year will herald further changes in the respective roles of the two organizations.