June 27, 2001
XML-DEV has been the site of a rather juicy debate this week, concerning one of the W3C's latest Working Drafts: a revision of XML 1.0 codenamed "Blueberry".
Did You Say Blueberry?
"Blueberry" is a W3C Working Draft that captures the requirements for a forthcoming revision to the XML specification. Its origins lie with two sources. The first of these is a Note submitted to the W3C in March by IBM ("The [NEL] Newline Character"). This note highlighted the fact that the NEL character, used as a line terminator on some IBM mainframes, was not supported by XML. The noted proposed that the specification be amended to make the character legal as a line ending, thereby improving the ability of some mainframe users to edit XML documents using native editors.
The second driver for the Blueberry revision is the ongoing development of the Unicode standard. XML gives the freedom to use any Unicode character within the text (PCDATA) of an XML document, but strictly and explicitly limits the characters that can be used to name an XML element. Addition of new characters, and particularly support for many new language scripts (e.g. Amharic, Burmese) in Unicode 3.1, means that these naming limitations have become increasingly restrictive. Effectively this means that some XML users cannot have fully natively marked up documents, they are limited to using (say) Latin characters as names, but can use any character within element content.
The Blueberry requirements are aimed at removing this limitation, providing support for the NEL character as a line end, and suggesting that the ongoing relationship of XML and Unicode be reviewed by the XML Core Working Group.
The publication of the draft has drawn a great deal of comment from the members of XML-DEV, not a great deal of which was positive. Elliotte Rusty Harold was particularly vociferous in challenging the need for a revision at all, attacking IBM for its perceived inability to bring its software in line with the existing specification and arguing at length that the requirement for fully native markup is greatly overstated. In one of many postings on this issue, Harold called for a clear appraisal of the costs and benefits that XML Blueberry will bring:
...what's needed is a rational comparison of the real advantages and disadvantages of both approaches. How big are the benefits that are gained by this? How big are the disadvantages? Clearly both exist.
I don't think the potential benefits outweigh the disadvantages. Splitting XML into multiple incompatible implementations strikes me as a very bad thing. And make no mistake, this is just the first step, not the last. Unicode's got at least two more major iterations left in it that will force changes in XML parsers if we tie XML that closely to Unicode. It's not just blueberry but raspberry and blackberry too, and maybe other flavors!
What do we get in exchange for imposing major costs of transitions on developers around the world? Tags can now be written in a few extra scripts. And note that I say scripts, not languages. Many of the languages listed in the reqs have well-established traditions in other scripts as well.
Many others agreed with Harold, including Tim Bray who posted a lengthy message, worth reading in its entirety, that provides a clear analysis of the impacts of the specification. While supporting the underlying motivation, Bray ultimately believed the costs would be too high:
To cast it in the starkest possible light: Is it a reasonable trade-off to say that we will live with an incorrect interpretation of Unicode in certain specific areas, with the consequences of complicating the lives of mainframe users and impoverishing the tools available to worthy users of certain minority languages, to achieve the benefit of keeping XML monolithic and unitary? Yes, it's reasonable. I might be convinced that it's wrong, but it's a reasonable argument that needs to be addressed. Corollary: it's not enough simply to say "Blueberry is more correct per Unicode thus we have to do it, end of debate."
So I think it would be appropriate, in this discussion, to have some people in the mainframe trenches give us a briefing on the scale and the difficulty of the problems they face, and for some of our i18n gurus to highlight the problems faced by an XML language designer who wants to use one of the newly-added languages.
On the other side, we should consider the practicalities and costs of upgrading (or not) the installed base in the face of the deployment of data encoded in XML Blueberry.
Despite suggestions from both Bray and Harold, those who might benefit most from Blueberry have yet to make their presence felt, at least not in public on XML-DEV. For example, it would be interesting to hear from mainframe users exactly how much benefit Blueberry would bring and the problems they would encounter if it were not pursued. As usual the W3C have produced a public mailing list, www-xml-blueberry-comments, for interested parties to comment on the draft.
John Cowan, the Blueberry editor, has already been using the list to collate interesting comments from the XML-DEV discussion. Cowan has been attempting to fend off much of the negative feedback and has posted a draft algorithm, still to be reviewed by the WG, that would allow XML Blueberry documents to be backwards compatible -- an important requirement if the work is to proceed.
In fairness, not all of the comment has been negative. Vincent-Olivier Arsenault was among those who supported the Blueberry requirements:
This revision is indeed NECESSARY as (I think) XML should have a greater (if not complete) independence from any encoding specification and delegate it (all) to UNICODE. Thus, the key requirement to me would be (quoting from the June 20 WD requirement list) : "The working group shall consider the issue of future updates to Unicode."
As for the "they can write Latin markup anyways" argument, I don't see how we could EVER discriminate ANY cultural particularity (even if they SEEM obscure to us or to so-called "experts", lets not repeat the rfc822 mistake) by denying to its adherents their ability to create markup in the way they want. Isn't it just [like] imposing your own line-ending method except on a cultural level?
More XML-DEV members were more interested in the prospect of a revision to XML than the specific Blueberry requirements. Several members suggested that a more wide-ranging revision ought to be considered. David Brownell mused that a larger revision might help avoid what he termed " death by a thousand cuts." Eric van der Vlist believed that the costs associated with upgrading software to support Blueberry could be mitigated if some additional benefits were gained:
There is a (huge) fixed cost to any breaking of backward compatibility and it doesn't seem reasonable to do it for a single feature.
If we need to create a new XML version, we should probably try to package several "minor" changes to make sure that the benefits outweigh the cost.
Daniel Veillard was also seeking more of a morsel than Blueberry is offering, but acknowledged that the effort would be greater:
Since the bulk of the newly specifications produced are using XML-1.0 + Namespace + Infoset as their base platform, the idea of making a bag of those, plus the updates to Unicode and label it with version="2.0" would have the merit to put upfront the real level of processing and standard required. Not all applications would require it, but the gap is large enough the it's worth going through the revision process. But it would not require efforts on the same scale than what XML Blueberry would require, it's larger.
John Cowan was quick to put these suggestions to rest however:
The last thing the Core WG wants is to be deluged with requests for worthy improvements to XML. Hence the meta-requirement to change nothing that we don't need to change to achieve our stated goals.
Also in XML-Deviant
Its not yet clear how Blueberry documents will be identified to parsers. The obvious option would be to mark this revision as a new release of XML, and use the version pseudo-attribute in the XML declaration. Several people considered this to be the best way forward. Yet John Cowan indicated that the Working Group had not yet decided one way or another, but he didn't rule out this possibility. He also mentioned that adding a new pseudo-attribute to the XML declaration, to signal the version of Unicode used by the document's contents, was being considered.
Some contributors favored an alternative approach that relies on the encoding attribute in the declaration instead. This would signal a "special" encoding indicating that the document used the NEL character as a line terminator. Parsers would then support this encoding as necessary. However this proposal doesn't address the requirement for expanding support for additional name characters and appears to be mainly favored by those who believe this to be an unnecessary step.
The differences may seem cosmetic at first, but this may well turn out to be the first new version of XML; the current "second edition" being mainly collected errata. From the debate so far its clear that this is not the XML 1.1 that many were hoping for, and despite the changes seeming marginal in nature (after all the majority of XML users are unlikely to benefit from them), the likely impact is significant. Whether this is a side effect of the success of XML or a knee-jerk response to yet another change in an already rapidly developing area only time will tell.