Sign In/My Account | View Cart  
advertisement


Print
The Long, Long Arm of SGML

The Long, Long Arm of SGML

by Kendall Grant Clark
November 05, 2003

SGML's Influence, XML's Anxiety

Writing about XML invites a certain humanistic pretension, for several reasons. First, all this talk of universality and frictionless information exchange bears uncanny allusions, for those of us who inhabit certain parts of the ideological landscape, to what Christians used to call (before they learned about language and gender bias) the "brotherhood of man". Sure, in the end, it's all just bits and bytes, ones and zeroes; but in the XML world all that Matrixesque machine austerity comes packaged in a particular set of high tones and lofty ideals.

Second, since XML gets used in the production and dissemination of documents, artifacts which are still and will long be the primary bearers -- second only to natural persons, one suspects -- of cultural transmission, it tends to attract (or reward?) technical people with humanities backgrounds and trainings. Sure, in the end, the data wonks are likely to take over, and then we humanities-derived folks will be out on our collective asses; but for now, at least, we tend to fight them to a healthy draw.

Thus it is that, when reviewing the recent debate about Tim Bray's UTF-8+names proposal, I couldn't help but think of it in terms of an important theory of poetry, the one promulgated by Yale literary critic and professional academic contrarian Harold Bloom, in his book The Anxiety of Influence. Since this is an XML column, I won't go into much detail about Bloom's theory other than to give a synopsis: Bloom argues that poetry (and, by extension, all fiction and, perhaps, all of the creative arts) is formed in the struggle between generations of poets; that is, younger poets struggle to overcome the sense of anxiety brought about by the unbearable influence of the older, stronger poet. The primary means of enacting this struggle is for the younger poet to creatively misread, misconstrue, and mistake the work of the stronger, older poet. In these acts of creative erring, the younger poet forms a space within which some act of poetic originality, that is, of poetic independence, may be achieved.

What's the connection, you may be asking, between a theory of poetry and Tim Bray's proposal? The connection is the long, long arm of SGML.

Some significant percentage of the pain suffered by the XML development community over the past 5 years is directly attributable to dealing with the legacy of SGML. It has, in other words, turned out to be much harder, much more complex to do "SGML on the Web" than many people thought it would be. A considerable amount of the early traction seized by XML was due to the confluence of two forces: first, the technical maturity of SGML; second, the early to middle years of exuberance about the Web itself.

In various ways then, XML has really been about trying to overcome the legacy of SGML. Perhaps "overcome" isn't quite right; perhaps "modify and contemporize" is better? At any rate, XML has been driven in part by a sense that SGML had things right, but not just right, and that work remains to be done to overcome SGML's failings.

What About All These Funny Characters?

Tim Bray's recent proposal -- presented in IETF RFC form, no less -- for fixing the "funny character" issue in XML is a case in point. In XML you have two choices for creating memorable shortcuts for entering various Unicode characters. First, you can use a numeric character reference (NCR), which is an entity (an &...; construct) formed from Unicode code points. The problem with NCRs is that they aren't very memorable; they're rather anti-mnemonic, in fact. Second, you can use an internal parsed entity, which is basically a binding, declared in a DTD, between a pair of arbitrary strings. So, for example, one can declare a set of bindings between human-friendly strings and NCRs; thus the producer of an XML document can use the friendly form which gets turned into the NCR form. Anyone who's done any real work with SGML knows about and has used such sets of entities.

As Bray puts it, "...these techniques in XML were inherited directly from SGML." But part of the struggle to overcome the legacy of SGML has been to find ways to do without DTDs. "For a variety of reasons," Bray says, "authors increasingly wish to avoid the use of DTDs, but still want to retain the convenience and readability of internal parsed entities." There is, in truth, a world of struggle to overcome the legacy of SGML packed into this disarmingly simple little sentence. Obviously the big move on this front was XML's introduction of the idea of well-formedness as against validity. As we all know by now, well-formed XML instances don't require DTDs. Other XML technologies replace aspects of DTD functionality, including W3C XML Schema and RELAX NG, XInclude, and the ongoing work on xml:id.

Bray's proposal is simple enough, really. He suggests adding another character set, one which is a very strict superset of UTF-8, which he calls UTF-8+names. Basically, the UTF-8+names character set is the UTF-8 character set plus a set of replacements, which are sequences that begin with "&", have some other character string, and end with ";". The character string enclosed by "&" and ";" is something Bray calls the "replacement name", and it is a representation of a Unicode character sequence which he calls the "replacement value". Thus, when using the UTF-8+names character set in an XML instance, one can use character sequences which look for all the world just like ol' SGML entities -- ü -- but which are, in fact, simply containers of replacement names representing replacement values.

Bray's proposal met with fairly vigorous reaction. Seairth Jacobs' reaction (seconded by Elliotte Rusty Harold), that Bray might want to consider a different format other than one which stuck so carefully to SGML's legacy format, is an interesting one, and it highlights the sense in which SGML is still the thing that XML people are often reacting to and against. Why not, as Jacobs said, a "@name;" or "#name;" form?

What does Bray's proposal show? That, as he puts it, "it is technically feasible to provide named characters without touching XML by using an alternative encoding of Unicode." That's a useful showing, but it's not clear that it's a viable way to move past SGML for this particular issue. Even more to the point, Bray adds that "there is no realistic prospect of adding entity declaration to any of the modern schema facilities or of somehow shoehorning it into XML itself in a DTD-less way."

A Competing Proposal

There's another proposal floating around XML-DEV lately, but it's not really a competitor, inasmuch as Bray's proposal was really just a thought experiment. Richard Tobin's proposal, which I think to be relatively sane and even a bit clever, uses XML namespaces to declare entities in XML attributes, thus:

xmlent:eacute="é"

Also in XML-Deviant

The More Things Change

Agile XML

Composition

Apple Watch

Life After Ajax?

"é" is thus replaced by "&#xe9", the relevant NCR, within the element on which this attribute exists. There's also an XML entity file version, using the attribute xmlentfile, the value of which is one or more URIs.

In addition to ridding ourselves of one of the last remaining needs for DTDs, Tobin's proposal, owing to its element scoping, also means that arbitrary XML fragments which include entities and, in Tobin's proposal, the declaration of those entities, become easier to include, arbitrarily, in other XML fragments or instances. That's very handy.

No matter which proposal, whether Bray's or Tobin's or someone else's, there seems to be a renewed energy among the members of XML-DEV, and perhaps among the XML development community at large, to renew the struggle to overcome the remaining vestiges of SGML's legacy, including the DTD. I'm not sure that a world without DTDs will be a better world, but it will be a new one. And it will have been achieved by means of a struggle with our predecessors and precursors.




Titles Only Titles Only Newest First
  • Mismatch between parsing and generation
    2003-12-15 09:43:33 John Cowan

    A roadblock to the use of DTD internal subsets for entities in the present design is that XML parsers are only guaranteed to process the internal subset and are free to ignore any external subset, whereas XSLT can only generate a reference to an external subset on output and has no facilities for regenerating any internal subset. So a pipeline of XML transformations quickly loses the entity information and has no provision for re-creating it.

  • Basically right, but overreading a bit
    2003-11-10 09:54:05 Wendell Piez

    I think Kendall has this basically right, but there's also a bit of overreading which gets in the way. What he says about the SGML legacy in XML points to a significant aspect of the problem without actually revealing it -- in fact it's rather masked by the "Anxiety of Influence" analogy ... which also has it only partly right. It's true that every generation of engineers dismisses the challenges and disparages the solutions of the preceding generation (while being indelibly imprinted by them); but there's also more going on here.


    The requirement for human-legible representations of non-keyboard (non-displayable) characters exactly straddles the line between XML-document-as-lexical-instance (unparsed) and XML-document-as-model (parsed, probably into a tree) that so disconcerts XMLers. There is only one solution to this, namely to provide the processor with a mapping of external representations to internal structures. All the proposals are variations of this:


    Tim Bray - standardize a mapping and build it into the tools
    Richard Tobin - declare the mappings in namespaced attributes, not in a DTD
    old-fashioned - use a DTD, internally if you want to go standalone (what's the big deal?)


    The differences amount to differences in (a) required infrastructure, and (b) level of standardization -- but none of them get to the heart of the matter, which is that much as XML developers want to reduce the role of XML-as-lexical-instance in favor of the purity of the model, they (we) just can't get away from the fundamental requirement addressed by XML (or SGML before it) in the first place -- to represent something as complex as that model (to say nothing of the real-world documents or objects that that model seeks to represent!) in something as lo-fi as a stream of 7bit ASCII characters.


    Many SGML features that are so disparaged or derided by XMLers make much more sense if you look at this requirement in the context of systems with 4MHz processors and 640Kb of RAM (which, as some of us recall, is as much as we will ever need) ... in that world, you really want a DTD to configure lexical aspects of markup such as tag omissibility or DATATAG, which XML has decided there's no call for. Consequently, SGML was much more willing to see the lexical instance as a primary artifact (not just a temporary serialization of the "real thing") and much more ecumenical with respect to processing models than is XML. (Anyone remember the Desperate Perl Hacker? Tree-based namespaces did him in: R.I.P.)


    My own prediction is that this particular proposal won't really go places -- DTDs, especially given internal subsets, just aren't that broken -- but that the underlying issue won't disappear either. It's only evident, now, in brushfires like this one; but as long as XML is still growing into the application spaces where the tree model is sufficient -- which is to say, as long as we can manage to forego our needs for even more complex kinds of representation such as overlapping structures -- there'll be enough to keep us busy, and the worst stresses and strains will remain potential.


    In the meantime, general entities, and the kind of bridge-to-the-serializer solution pioneered by Zarella and Tony, will be enough.


    SGML is Dead! Long Live SGML!


  • xmlchar
    2003-11-08 03:59:08 Anthony Coates

    Another alternative is the 'xmlchar' library, which Zarella Rendon & I based on some earlier musings of Tim's. It is just a set of XML elements, one for each special character, that you can embed in your documents. Then there is an XSLT stylesheet that lets you do a final post-process to convert the xmlchar elements into numeric character references. Simplistic, but it works. So far it only covers the HTML special characters, but that contains 80% of the special characters used by 80% of the people. See
    http://sourceforge.net/projects/xmlchar/

    • xmlchar
      2003-11-08 04:01:14 Anthony Coates

      Oh yes, nearly forgot, Zarella and I did an XML.com article about it a while back.
      http://www.xml.com/pub/a/2003/01/02/xmlchar.html

  • The Competing Proposal
    2003-11-06 03:37:14 d e

    Unless I'm missing something there is a flaw it this. Current tools would not see these documents as being well formed, as the entity reference is not defined in any DTD. If it used a different syntax, without significance in XML 1.1, it could be backward compatible. This would be similar to the way XML namespaces do not break well-formedness when parsed by non-namespace aware applications.


    Also, the elements (typically the root) could become cluttered with these attributes. Also there is no way to import a set of them as for example the entities used in HTML.