XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Extending the Web: XHTML Modularization

January 16, 2002

Many ... have indicated that they wish to subset and extend HTML ... They want to do this to accommodate device-specific functionality, to limit the content that is sent to smaller-footprint devices, or to enhance their ability to produce useful Internet content ... The best way to satisfy these requirements ... is to define a framework that can be used to develop markup languages derived from HTML. Once defined, the framework would be used as a means for defining extensions to XHTML, and as a set of building blocks that markup language designers could use to bring the extensions together with the base into a cohesive whole. -- XHTML Modularization: An Overview

Content, Markup Languages, and the Web

In the beginning, ordinary Web designers and content creators knew HTML, and it was good. Or good enough. A reasonably computer-literate person could learn to create HTML documents with a reasonable effort and within a reasonable time. If it took a week of evenings to become comfortable with the main features of HTML, that was a small investment to make for a big return.

The Web's success, then, is due in part to the simplicity and generality of HTML. While much the same can be said, at a different level of complexity, about HTTP, it can't be said for as many people. There will always be more people writing pages for the Web than building servers for it. HTML is one of the key reasons the Web went from dozens to billions of documents very quickly. But HTML's simplicity and generality are not without cost.

As markup languages go, HTML is pretty easy for people to handle, but some of the features that make it easy for people make it hard for machines (that is, hard for other people using machines to do automated handling). While the document semantics that HTML provides hit the sweet spot for a particular range of kinds of text, its very generality makes it unfitting for capturing the semantics of genre texts (say, the ubiquitous FAQ) or for the kinds of application-oriented data HTML is often used to communicate.

Since the W3C's long-term vision for the Web was always something like what is now called the Semantic Web, machine-processable documents with domain-specific semantics have always been perpetually just over the horizon. But the goal of an extensible Web lingua franca wasn't only the residue of AI dreams in the brains of the Web's creators. Eventually a sense of frustration emerged among content creators, too. While HTML was mostly what they needed, sometimes it was too general. Why settle for an elaborate, nested, presentational <TABLE>-<UL>-<P> structure when what you really want to write (because what you really want to say) is

    What is the international legal test for using force in
    The classic test for self-defensive force comes from Daniel
    Webster's diplomatic notes in the Caroline case. In order for an
    anticipatory use of force to be legitimate self-defense, there
    must, Webster wrote, be a "necessity of self-defense, instant,
    overwhelming, leaving no choice of means, and no moment for
    deliberation"; the force used must not be "unreasonable or
    excessive"; it must be "limited by that necessity and kept clearly
    within it".

No one would prefer to write an elaborate, fragile presentational markup structure if they could instead write what they mean directly. The FAQ is just one of many genre and domain documents, each of which has specific semantics, that need to be published on the Web.

Early in the development and marketing of XML, content creators were told that XML was the fix for the HTML-extensibility problem -- countless articles assured them that by using XML, they could create "all the tags that were needed". Much of the earliest evangelization and resultant buy-in of XML happened because content creators, and their communities and organizations, took XML to be a palliative for their frustrations about extensibility.

However, under pressure from corporations, standards bodies, and communities of programmers -- which exerted different kinds of pressure -- XML became, and is becoming, a complicated family of technologies, one beyond the grasp of many content creators, most of whom don't know or don't want to know a DOM from a SAX event handler from a parameter entity.

Redeeming a Promise of XML

So, for content creators, XHTML -- the W3C's "reformulation" of HTML 4 as an XML application -- is in fact what XML was in market-speak: a way to semantically extend the Web's lingua franca by adding domain- and genre-specific elements and attributes. While the primary way of pitching XHTML to date has been to emphasize its capacity to do small-device customization, this boils down to domain-specific extensibility in the end, even if the mode of extensibility is subtraction of elements and attributes, rather than addition.

Extending XHTML is analogous to designing a domain-specific language (DSL). A DSL is a programming language meant to solve a range of problems within a particular domain. A casual search of the Web reveals DSLs for graphics, computer security, Web caching, process scheduling, simulations, device drivers, and dozens of others. Sometimes they are implemented from scratch, that is, in C or another system language; sometimes they are embedded, that is, implemented in or by some higher-level, general-purpose language.

The lines of our little modular calculus (to misappropriate writer Samuel R. Delany's term of art) run this way: XML is a general-purpose language with which the W3C has implemented XHTML, which is a platform for embedding content-domain-specific markup languages within the Web's lingua franca.

Paul Hudak, a computer scientist at Yale, proposed a process for building domain-specific languages, and it adapts pretty well to extending XHTML, to building a content-domain-specific markup language:

  1. Choose a content domain for which a markup language is needed
  2. Develop an informal model that accurately captures the semantics of that content domain
  3. Create an XHTML module that represents the informal model
  4. Write instances of the new content-domain-specific markup language

Pages: 1, 2

Next Pagearrow