The Long, Long Arm of SGML

November 5, 2003

SGML's Influence, XML's Anxiety

Writing about XML invites a certain humanistic pretension, for several reasons. First, all this talk of universality and frictionless information exchange bears uncanny allusions, for those of us who inhabit certain parts of the ideological landscape, to what Christians used to call (before they learned about language and gender bias) the "brotherhood of man". Sure, in the end, it's all just bits and bytes, ones and zeroes; but in the XML world all that Matrixesque machine austerity comes packaged in a particular set of high tones and lofty ideals.

Second, since XML gets used in the production and dissemination of documents, artifacts which are still and will long be the primary bearers -- second only to natural persons, one suspects -- of cultural transmission, it tends to attract (or reward?) technical people with humanities backgrounds and trainings. Sure, in the end, the data wonks are likely to take over, and then we humanities-derived folks will be out on our collective asses; but for now, at least, we tend to fight them to a healthy draw.

Thus it is that, when reviewing the recent debate about Tim Bray's UTF-8+names proposal, I couldn't help but think of it in terms of an important theory of poetry, the one promulgated by Yale literary critic and professional academic contrarian Harold Bloom, in his book The Anxiety of Influence. Since this is an XML column, I won't go into much detail about Bloom's theory other than to give a synopsis: Bloom argues that poetry (and, by extension, all fiction and, perhaps, all of the creative arts) is formed in the struggle between generations of poets; that is, younger poets struggle to overcome the sense of anxiety brought about by the unbearable influence of the older, stronger poet. The primary means of enacting this struggle is for the younger poet to creatively misread, misconstrue, and mistake the work of the stronger, older poet. In these acts of creative erring, the younger poet forms a space within which some act of poetic originality, that is, of poetic independence, may be achieved.

What's the connection, you may be asking, between a theory of poetry and Tim Bray's proposal? The connection is the long, long arm of SGML.

Some significant percentage of the pain suffered by the XML development community over the past 5 years is directly attributable to dealing with the legacy of SGML. It has, in other words, turned out to be much harder, much more complex to do "SGML on the Web" than many people thought it would be. A considerable amount of the early traction seized by XML was due to the confluence of two forces: first, the technical maturity of SGML; second, the early to middle years of exuberance about the Web itself.

In various ways then, XML has really been about trying to overcome the legacy of SGML. Perhaps "overcome" isn't quite right; perhaps "modify and contemporize" is better? At any rate, XML has been driven in part by a sense that SGML had things right, but not just right, and that work remains to be done to overcome SGML's failings.

What About All These Funny Characters?

Tim Bray's recent proposal -- presented in IETF RFC form, no less -- for fixing the "funny character" issue in XML is a case in point. In XML you have two choices for creating memorable shortcuts for entering various Unicode characters. First, you can use a numeric character reference (NCR), which is an entity (an &...; construct) formed from Unicode code points. The problem with NCRs is that they aren't very memorable; they're rather anti-mnemonic, in fact. Second, you can use an internal parsed entity, which is basically a binding, declared in a DTD, between a pair of arbitrary strings. So, for example, one can declare a set of bindings between human-friendly strings and NCRs; thus the producer of an XML document can use the friendly form which gets turned into the NCR form. Anyone who's done any real work with SGML knows about and has used such sets of entities.

As Bray puts it, "...these techniques in XML were inherited directly from SGML." But part of the struggle to overcome the legacy of SGML has been to find ways to do without DTDs. "For a variety of reasons," Bray says, "authors increasingly wish to avoid the use of DTDs, but still want to retain the convenience and readability of internal parsed entities." There is, in truth, a world of struggle to overcome the legacy of SGML packed into this disarmingly simple little sentence. Obviously the big move on this front was XML's introduction of the idea of well-formedness as against validity. As we all know by now, well-formed XML instances don't require DTDs. Other XML technologies replace aspects of DTD functionality, including W3C XML Schema and RELAX NG, XInclude, and the ongoing work on xml:id.

Bray's proposal is simple enough, really. He suggests adding another character set, one which is a very strict superset of UTF-8, which he calls UTF-8+names. Basically, the UTF-8+names character set is the UTF-8 character set plus a set of replacements, which are sequences that begin with "&", have some other character string, and end with ";". The character string enclosed by "&" and ";" is something Bray calls the "replacement name", and it is a representation of a Unicode character sequence which he calls the "replacement value". Thus, when using the UTF-8+names character set in an XML instance, one can use character sequences which look for all the world just like ol' SGML entities -- ü -- but which are, in fact, simply containers of replacement names representing replacement values.

Bray's proposal met with fairly vigorous reaction. Seairth Jacobs' reaction (seconded by Elliotte Rusty Harold), that Bray might want to consider a different format other than one which stuck so carefully to SGML's legacy format, is an interesting one, and it highlights the sense in which SGML is still the thing that XML people are often reacting to and against. Why not, as Jacobs said, a "@name;" or "#name;" form?

What does Bray's proposal show? That, as he puts it, "it is technically feasible to provide named characters without touching XML by using an alternative encoding of Unicode." That's a useful showing, but it's not clear that it's a viable way to move past SGML for this particular issue. Even more to the point, Bray adds that "there is no realistic prospect of adding entity declaration to any of the modern schema facilities or of somehow shoehorning it into XML itself in a DTD-less way."

A Competing Proposal

There's another proposal floating around XML-DEV lately, but it's not really a competitor, inasmuch as Bray's proposal was really just a thought experiment. Richard Tobin's proposal, which I think to be relatively sane and even a bit clever, uses XML namespaces to declare entities in XML attributes, thus:

xmlent:eacute="é"

Also in XML-Deviant

The More Things Change

"é" is thus replaced by "&#xe9", the relevant NCR, within the element on which this attribute exists. There's also an XML entity file version, using the attribute xmlentfile, the value of which is one or more URIs.

In addition to ridding ourselves of one of the last remaining needs for DTDs, Tobin's proposal, owing to its element scoping, also means that arbitrary XML fragments which include entities and, in Tobin's proposal, the declaration of those entities, become easier to include, arbitrarily, in other XML fragments or instances. That's very handy.

No matter which proposal, whether Bray's or Tobin's or someone else's, there seems to be a renewed energy among the members of XML-DEV, and perhaps among the XML development community at large, to renew the struggle to overcome the remaining vestiges of SGML's legacy, including the DTD. I'm not sure that a world without DTDs will be a better world, but it will be a new one. And it will have been achieved by means of a struggle with our predecessors and precursors.