XML 1.1: Here We Go Again

October 23, 2002

Despite the frequent and usually accurate complaints that XML specifications and standards are insufficiently layered, there is a sort of conceptual stack of technologies which together constitute the architecture of the Web. In this week's XML-Deviant column I report on developments in XML, the base layer of the Web's architecture.

XML 1.1 Candidate Recommendation

We find XML at or near the bottom of the stack. Stability in the base is crucial to any sound architectural design. That is, turbulence in the base is harmful, not just to a design but to its implementation, too. We should expect, then, XML to change very slowly, if or when it changes at all. The ideal outcome for 1.0 version of the XML specification was for it to have been left untouched, forgotten in some dusty corner of the W3C for a good five years.

We almost made it. The fifth anniversary of the first edition of XML 1.0 is next February. Of course the XML Namespaces specification should count as a change to XML 1.0. It being promulgated as a separate document didn't minimize significantly the turbulence caused by its adoption. And the only sort of change namespaces can reasonably be thought to be is a major one. So, while it's clear that XML 1.0 has flaws, a good argument can be made that, for the sake of stability and maturity at higher levels, XML 1.0 should be left alone for, say, another three years.

But reasonable people disagree, and the decision has been reached not only to work on a 1.1 revision of the XML spec, but a Candidate Recommendation has been released (there's also a Namespaces in XML 1.1, a Last Call Working Draft). Conceptual tinkering, merely carrying out thought experiments, with XML 1.0 isn't harmful. Actually issuing a new version of the specification may be. After all, a new version of XML will cause a great many people to spend a considerable amount of time updating documents, parsers and other tools, documentation, training materials, books, presentations, and so on. In this case XML's ubiquity counts against, rather than for it.

So in this case, as in many technological cases, making a decision is a matter of balancing costs and benefits. There are at least three questions to answer -- first, should XML 1.0 be revised now; second, having been revised, should anyone implement it; third, having been implemented, should anyone adopt it? There isn't a single logic covering each of these questions. The cost-benefit analysis is significantly different depending on whether you're a specification writer, a parser or other tool supplier, or an XML end user.

From 1.0 to 1.1

In the case of specification writers -- members of the XML Core Working Group -- the decision to revise XML was reached some time ago, and the rest of us are just now starting to deal with the repercussions. Second guessing that decision is valuable, but I'll have to save it for another time. In the remainder of this column I want to consider the other two questions; namely, what does the cost-benefit analysis look like for XML tool suppliers and what does it look like for XML end users (that is, programmers)?

Before we can make an assessment, we need to understand the most important changes in XML 1.1, most of which have to do with Unicode, line-ending conventions, character sets and encodings, and other areas of general confusion and benign neglect. I confess that I don't pay a lot of attention to these sorts of issues. And my attachment to Unicode, for example, is far more politically than technically motivated. Apparently I am not alone. Rick Jelliffe suggests as much during the debate about XML 1.1: "Character, encoding and normalization issues are simply too hard for programmers to do. XML provides the only real gateway where these things can be handled transparently, to shield the programmer from having to be aware of them, (to a great extent)". Or, as he puts it later in the debate, "Normalization is definitely a good thing. There should be more of it, especially by other people".

Elliotte Rusty Harold arranges the changes in three groups. First,

C0 control characters such as form feed, vertical tab, BEL, and DC1 through DC4 (whatever those are) are now allowed in XML text. However, they must be escaped as character references. They cannot be included literally in data. Nulls, thankfully, are still forbidden.

Next,

The C1 control characters such as BPH, IND, NBH, and PU1 are no longer allowed as literals in XML text. They too must now be escaped as character references. For the first time this means that some well-formed XML 1.0 documents are not well-formed XML 1.1 documents. The exception, of course, is IBM's holy grail of NEL, which will be allowed in literal XML text, just to make life difficult for every text editor on the planet except those from IBM mainframes.

Those sets of changes seem to be uncontroversial, in the sense that most everyone agrees that they are changes between 1.0 and 1.1. Not so the issue of Unicode character normalization, about which there is disagreement. According to Harold,

Unicode character normalization should be performed on XML documents, unless you don't feel like it, in which case you can ignore it. This almost makes sense. Basically it says that parsers may change an e followed by a combining accent acute into the single character é if they want to or the client asks for it. The details are quite complicated, but at least it's optional. However, I still worry that this is a source of interoperability problems, especially when it comes to names of elements and attributes. For instance, a normalizing validator might accept documents a non-normalizing validator would reject.

Part of the disagreement over Unicode character normalization centers around its status in the 1.1 specification. Is it permitted or required? John Cowan, the editor of XML 1.1, took issue with Harold's characterization of Unicode character normalization: "XML 1.1 says that parsers should check normalization, not that they should perform it. So a parser that sees an e followed by a combining acute should report the lack of normalization to the calling application. This is a most important distinction. XML generators should generate normalized output; XML accepters should check normalization".

Counting Costs, Weighing Benefits

Both Daniel Veillard and Tim Bray, developers of one of the fastest and one of the first XML parsers respectively, suggest that character normalization is problematic. Veillard said that he's worried about "the cost of normalizing on-the-fly" and, further, that "the algorithms I found in the Unicode annexes were just scary (in term of complexity and memory requirement) ... that cost is better done once at generation time". "I really understand the desire to clean up the normalization picture," Bray said, "but I think the cost is high and the nondeterministic behavior specified by 1.1 is a problem".

It's safe to say that many programmers are in the most enviable position of all, which is to say that they can for the most part simply ignore XML 1.1. It's safe to assume that most use of XML is internal. If the same entity controls both the consumers and providers of XML, in, that is, a closed environment, the distinction between XML 1.0 and 1.1 makes no difference at all. Unless or until they or their technological partners need something which XML 1.1 adds, they can pretty safely ignore it, at least till the point at which their preferred XML parser provider has upgraded to support 1.1. Eric van der Vlist represents this point of view: "Speaking for myself as a XML user, I clearly see the cost (I need to upgrade all my XML applications to use XML 1.1 since a XML 1.0 parser will not read a XML 1.1 document and a XML 1.0 "writer" will write only XML 1.0 documents), but I don't see any use case for the XML 1.1 in my own applications (I have never met a NEL and have never needed to define names beyond what possible with XML 1.0)".

Eric goes on to distinguish three types of XML application. First, internal or closed environment applications, about which Eric suggests a careful neutrality about the migration to XML 1.1 tools. I suspect that in practical terms that neutrality will amount to apathy, which is perfectly acceptable. Next, in XML applications where the point is to provide XML, the goal is to make that XML as widely available as possible, which will tend to count against migrating to XML 1.1 till it's ubiquitous or at least till it's reached the tipping point toward ubiquity. Last, in XML applications where the point is to consume XML, the goal is to be able to consume as much XML as is possible, that is, to fault on receiving XML as little as possible. In these cases, the tendency will be to lead the migration to XML 1.1.

So, not only is there no single logic covering the cost-benefit analysis of XML 1.1 for spec writers, tool makers, and end user programmers, there's also no single logic covering that analysis for programmers. How you parse the costs and benefits depends in large measure upon what sort of applications you are responsible for maintaining or creating.

Also in XML-Deviant

The More Things Change

As for the cost-benefit analysis for XML tool makers, it will depend on whether they have their own internal reasons for migrating or not migrating, reasons which will often be strictly economic in nature. It will also depend on what sort of relationship the tool maker has with the users of the tools. Open source tool makers have a different relationship to their end users than do proprietary vendors. The pressures for migrating to newer technologies are aligned differently for each group. It's not at all clear that there are very many useful, generalizable things to say about how tool makers will make these decisions. One thing that is certain is that the algorithmic complexities, raised by Veillard and Bray, to Unicode character normalization, for example, may well be a heavy cost for open source tool makers to pay, especially depending on the maturity and sophistication of the Unicode support in their implementation languages.

In some sense the success or failure of XML 1.1 rests with the middle group, the vendors and tool makers. As Michael Kay said, "A lot depends on the major parsers, though. If they decide their users aren't interested in XML 1.1 so they're not going to rush to implement it, then obviously XML 1.1 is dead. If they decide they're going to implement it whether users want it or not, then people will gradually adopt it without really noticing they have done so". And so it goes.