XHTML is the Most Important XML Vocabulary

May 21, 2003

Taking the long view of recent technology, XHTML may be the most important XML vocabulary ever created. What I mean is not that XHTML will be the most widely deployed XML vocabulary, though if we take the long view, it could be. What I mean is that XHTML puts XML's reputation -- and, by extension, the W3C's reputation -- on the line to a greater degree than any other XML vocabulary. (Which, if true, makes XHTML 2.0's relative absence in discussion on XML-DEV a puzzle...)

There are, as XML.com readers are fully aware, thousands of XML vocabulary projects proposed, underway, or completed. They range from the simple to the sublime; well, probably not sublime, but at least crucial to some larger technical or social endeavor. But XHTML is the most crucial, for the reputations of both XML and the W3C, because it is the most visible, the most document-centric, and the most central to the future health and vitality of the Web itself.

As I wrote in an XML-Deviant column last summer ("XHTML 2.0: The Latest Trick"), ordinary Web designers and content creators quickly learned and used early versions of HTML because they were reasonably easy to grasp, and the reward for learning them was substantial. A reasonably computer-literate person can still learn to create XHTML 1.1 documents with reasonable effort and within a reasonable time. Even if it takes a week of evenings to become comfortable with the main features of XHTML, that's a small investment to make for a relatively big return.

The Web's success, then, is due in part to the simplicity and generality of HTML. The ongoing success of the Web will be in part a function of maintaining a positive balance between how difficult and how empowering it is to learn XHTML. Some form of HTML, eventually XHTML, will always be the most common type of Web content; people will keep writing it by hand, building user interfaces with it, trying, succeeding, failing to scrape useful information from it, and so on. Any part of the Web's infrastructure with such a long future life cycle deserves careful, attentive, community shepherding.

XHTML 2.0 Continues to Evolve

The HTML Working Group released a new draft of XHTML 2.0 at the beginning of May. It is a draft which displays evidence that community feedback can make a difference to the development of a specification. In what follows I briefly comment on some of the most interesting bits of the new XHTML 2.0 draft.

The arrival of RELAX NG. Perhaps the most welcome development, particularly from the perspective of XML-DEV geeks, is the appearance of a normative RELAX NG schema for XHTML 2.0. This development is welcome because it signals a growing acceptance of RELAX NG -- a non-W3C schema specification language -- within the working groups of the W3C. It is also welcome because XHTML is among the most document-centric of all XML vocabularies, and having RELAX NG's fittingness for such vocabularies on display is a good thing.

The Edit Collection. The most striking difference between the Web as it evolved and the Web as it was intended (by, among others, Tim Berners-Lee) is the read-only nature of the Web for most of its users. In other words, the early vision of the Web, and the earliest implementations of Web browsers, was as a read-and-write medium and a read-and-write tool.

XHTML 2.0's section 6.4 "Edit Collection" adds back some support for Web content editing. The collection, according to the new draft, "allows elements to carry information indicating how, when and why content has changed." Particular XHTML 2.0 elements (including inline elements like <span>) can have an edit attribute, which can have one of four permissible values: inserted, deleted, changed, moved. One of these values, deleted, carries with it a "default presentation" which, in CSS terms, is display: none. For those of us for whom XHTML is or will be an editorial workflow document format, the Edit Collection is a move in the right direction.

The return of style. Surely the most hotly contested XHTML 2.0 change was an early draft's removal of the style attribute, which allows CSS designers to apply local style code to XHTML constructs. The debate between those who wanted to remove and those who wanted to preserve the style attribute hilighted a fundamental cleft in the XHTML community between -- to put it not too tendentiously -- markup geeks and presentation weenies. Each side got a bit nasty during the debate, causing no small amount of schadenfreude among bemused onlookers. (The anti-style attribute position was most aptly argued by Ian Hickson in -- whether you agree with it or not -- a classic mailing list post in January of this year.)

The HTML Working Group has demonstrated, however, that it knows how to listen to community squabbles, and it has restored the style attribute in the latest XHTML 2.0 draft. I suspect, however, that we have not heard the last word on this issue, and I wouldn't be at all surprised if the style attribute finds itself out in the cold again at some point.

The revenge of the nerds: <blockcode>. Moving on to an issue nearer to my geeky heart, the Working Group has added an analogue of the venerable <blockquote> just for programmers: <blockcode>. My only complaint is that the similar element names means a bit of my HTML muscle memory is going to have to be retrained. If you squint hard enough, <blockcode> is syntactic and semantic sugar for <pre><code>-sequences. It can carry a class attribute, which may be used to indicate the type of code contained in the block. I suspect that this is probably semantically underdetermined, but first things first. Even though this new feature is of no interest to the great majority of non-programming XHTML users, I can't help but think that it's one of my personal favorites. I look forward to being able to do stuff like this:

<blockcode class="http://www.python.org/">
from mailbox import UnixMailbox
from email import message_from_file; import sys

mbox = UnixMailbox(open(sys.argv[1], 'r'), message_from_file)
new_mbox = concat(sys.argv[2], 'w')
substring = sys.argv[3]

for message in mbox:
    if message['subject'].find(substring) != -1:
        new_mbox.write(message.as_string(unixfrom=1))
new_mbox.close()
</blockcode>

The return of <cite>. Paralleling the return of the style attribute, the cite element has also returned to the latest XHTML 2.0 draft. Though not as hotly contested as the removal of style, <cite> definitely has its fans and supporters, and I number myself among them. Though it is ironic since, in my experience, <cite> is by far the most often misused bit of HTML by XML.com authors. It isn't used very often, but when XML.com authors use it, it's almost always misused as if it were <citation>. cite takes a cite attribute, which I think would be better named "source", but that's merely a quibble.

Caption, Glorious Caption! I have been using HTML of one variety or another since 1995, and I have most frequently lamented the lack of a generic way to markup a caption for images. As newspaper and other hard media geeks know, editorial images just about always demand some kind of captioning text, usually containing image metadata of some kind or another: author, date, copyright, etc. In these editorial contexts, the lack of a caption construct has meant faking it with redundant and vague table-and-paragraph constructs. The advent of CSS has alleviated the pain here somewhat, but it's long past time that a first-class caption construct was added to XHTML.

Also in XML-Deviant

The More Things Change

I am very pleased to report that the latest XHTML 2.0 draft contains a provision for a caption element, which may reside within either table or object elements. I applaud this rational, simplifying, and long overdue addition. There is more than enough evidence of the utility and need for exactly this sort of addition.

XHTML 2.0 is headed in the right direction, even if you're among those who think that, for example, the style attribute should die a horrible death. Sometimes W3C working groups do not have much of an active user community with which to have dialog about its work. But in those lucky cases where there is such a community, working groups do well to pay careful attention to what they want and say. This general rule is even more important in the case of XHTML. Despite the widespread pessimism about XHTML's deployment, it is far, far too important to be left in the hands of a working group alone.