Extending the Web: XHTML Modularization

January 16, 2002

Many ... have indicated that they wish to subset and extend HTML ... They want to do this to accommodate device-specific functionality, to limit the content that is sent to smaller-footprint devices, or to enhance their ability to produce useful Internet content ... The best way to satisfy these requirements ... is to define a framework that can be used to develop markup languages derived from HTML. Once defined, the framework would be used as a means for defining extensions to XHTML, and as a set of building blocks that markup language designers could use to bring the extensions together with the base into a cohesive whole. -- XHTML Modularization: An Overview

Content, Markup Languages, and the Web

In the beginning, ordinary Web designers and content creators knew HTML, and it was good. Or good enough. A reasonably computer-literate person could learn to create HTML documents with a reasonable effort and within a reasonable time. If it took a week of evenings to become comfortable with the main features of HTML, that was a small investment to make for a big return.

The Web's success, then, is due in part to the simplicity and generality of HTML. While much the same can be said, at a different level of complexity, about HTTP, it can't be said for as many people. There will always be more people writing pages for the Web than building servers for it. HTML is one of the key reasons the Web went from dozens to billions of documents very quickly. But HTML's simplicity and generality are not without cost.

As markup languages go, HTML is pretty easy for people to handle, but some of the features that make it easy for people make it hard for machines (that is, hard for other people using machines to do automated handling). While the document semantics that HTML provides hit the sweet spot for a particular range of kinds of text, its very generality makes it unfitting for capturing the semantics of genre texts (say, the ubiquitous FAQ) or for the kinds of application-oriented data HTML is often used to communicate.

Since the W3C's long-term vision for the Web was always something like what is now called the Semantic Web, machine-processable documents with domain-specific semantics have always been perpetually just over the horizon. But the goal of an extensible Web lingua franca wasn't only the residue of AI dreams in the brains of the Web's creators. Eventually a sense of frustration emerged among content creators, too. While HTML was mostly what they needed, sometimes it was too general. Why settle for an elaborate, nested, presentational <TABLE>-<UL>-<P> structure when what you really want to write (because what you really want to say) is


<faqItem>

  <question> 

    What is the international legal test for using force in

    self-defense?

  </question>

  <answer>

    The classic test for self-defensive force comes from Daniel

    Webster's diplomatic notes in the Caroline case. In order for an

    anticipatory use of force to be legitimate self-defense, there

    must, Webster wrote, be a "necessity of self-defense, instant,

    overwhelming, leaving no choice of means, and no moment for

    deliberation"; the force used must not be "unreasonable or

    excessive"; it must be "limited by that necessity and kept clearly

    within it".

  </answer>

</faqItem>

No one would prefer to write an elaborate, fragile presentational markup structure if they could instead write what they mean directly. The FAQ is just one of many genre and domain documents, each of which has specific semantics, that need to be published on the Web.

Early in the development and marketing of XML, content creators were told that XML was the fix for the HTML-extensibility problem -- countless articles assured them that by using XML, they could create "all the tags that were needed". Much of the earliest evangelization and resultant buy-in of XML happened because content creators, and their communities and organizations, took XML to be a palliative for their frustrations about extensibility.

However, under pressure from corporations, standards bodies, and communities of programmers -- which exerted different kinds of pressure -- XML became, and is becoming, a complicated family of technologies, one beyond the grasp of many content creators, most of whom don't know or don't want to know a DOM from a SAX event handler from a parameter entity.

Redeeming a Promise of XML

So, for content creators, XHTML -- the W3C's "reformulation" of HTML 4 as an XML application -- is in fact what XML was in market-speak: a way to semantically extend the Web's lingua franca by adding domain- and genre-specific elements and attributes. While the primary way of pitching XHTML to date has been to emphasize its capacity to do small-device customization, this boils down to domain-specific extensibility in the end, even if the mode of extensibility is subtraction of elements and attributes, rather than addition.

Extending XHTML is analogous to designing a domain-specific language (DSL). A DSL is a programming language meant to solve a range of problems within a particular domain. A casual search of the Web reveals DSLs for graphics, computer security, Web caching, process scheduling, simulations, device drivers, and dozens of others. Sometimes they are implemented from scratch, that is, in C or another system language; sometimes they are embedded, that is, implemented in or by some higher-level, general-purpose language.

The lines of our little modular calculus (to misappropriate writer Samuel R. Delany's term of art) run this way: XML is a general-purpose language with which the W3C has implemented XHTML, which is a platform for embedding content-domain-specific markup languages within the Web's lingua franca.

Paul Hudak, a computer scientist at Yale, proposed a process for building domain-specific languages, and it adapts pretty well to extending XHTML, to building a content-domain-specific markup language:

Choose a content domain for which a markup language is needed
Develop an informal model that accurately captures the semantics of that content domain
Create an XHTML module that represents the informal model
Write instances of the new content-domain-specific markup language

A Brief Overview of XHTML Module Creation

Adapting Hudak's DSL-creation process is useful, but step 3 is inaccurate in a way worth remarking. Because XHTML is really a collection of modules fitted together to act like HTML 4, one can equally embed a new module within XHTML, thus using XHTML like a host language, or embed some XHTML modules into a separate markup language, thus using XHTML as an integration tool.

The modularization of XHTML makes both patterns of embedding possible, but in what follows I concentrate on extending rather than integrating XHTML. Extending XHTML involves nontrivial XML DTD hacking. The W3C aims to make XHTML modules implementable using W3C XML Schema. For many people W3C XML Schema hackery is more off-putting than XML DTD hackery.

The easiest way to extend XHTML informally is to do without it. The XHTML specification requires XHTML instances to be valid; there is no well-formed but not valid XHTML. If you add elements or attributes to XHTML, but do not go through the DTD or Schema hacking contortions to extend XHTML formally, the resulting instances may be well-formed XML, but they aren't formally XHTML. Depending on what you need, that may or may not matter.

A DTD specifies elements, element attributes, and element content models. It follows, then, that an XHTML extension module specifies elements, element attributes, or element content models. You extend XHTML by adding elements, attributes, modifying content models, or some combination of these. The concrete implementation of an XHTML module requires both a qname (qualified name) module, which does namespace handling, and a declaration module, which holds the element, element attribute, and content model declarations. The declaration module uses the parameter entities declared in the qname module.

The qname Module

Through clever use of INCLUDE and IGNORE sections, the qname module declares all the qualified names of the XHTML module, including whether or not XML namespaces are used. A qname module contains at least five parameter entities (see Norm Walsh's "What is XML?" for a refresher on parameter entities), plus one for each new element the module declares; the names of these parameter entities are formed with the name of the module being defined.

So, for example, if you were building an XHTML module for FAQs for the United Nations, you might name your module "unfaq" and put the following parameter entities into the qname module.

First, unfaq.prefixed, which has as its default value, "%NS.prefixed;"; declares whether or not unfaq's elements are to be used with XML namespace prefixed names. The default value of the parameter entity that unfaq.prefixed points to is IGNORE.

Second, unfaq.xmlns, which contains unfaq's namespace URI, http://www.un.org/XML/XHTML/faq/1.0/.

Third, unfaq.prefix, which contains unfaq's default prefix string, which is used when prefixing is turned on: "unfaq".

Fourth, unfaq.pfx, which has %unfaq.prefix; as its value if prefixing is turned on; otherwise has nil value.

Fifth, unfaq.xmlns.extra.attrib, which has as its value the declaration of any XML namespace attributes for any namespaces used in the unfaq module.

Sixth, for every element defined by unfaq, the qname module contains a parameter entity that holds the qualified name. If unfaq declares three new elements -- faqItem, question, and answer -- its qname module would have parameter entities unfaq.faqItem.qname (value: %unfaq.pfx;faqItem), unfaq.question.qname (value: %unfaq.pfx;question), and unfaq.answer.qname (value: %unfaq.pfx;answer). Thus, if prefixing is turned on, the faqItem element will be written as <unfaq:faqItem>, otherwise it will be <faqItem>.

The declaration Module

The declaration module of an XHTML extension module contains the actual declarations of all elements, element attributes, and element-content models which together constitute the module. In the case of the hypothetical UN FAQ module, the declaration module would contain declarations for the three elements, faqItem, question, and answer. The ATTLIST for each element, in addition to the attributes required by the content domain itself, must also contain a parameter entity, %NS.decl.attrib;, if prefixing is turned on. If prefixing is turned off, the ATTLIST must also include the specific namespace information for the module.

The trick to building the declaration module is to remember to make declarations about the parameterized structures from the qname module; that is, the declaration module looks like an ordinary XML DTD, except that it uses qualified names via parameter entities. For example,


<!ELEMENT %unfaq.faqItem.qname;  ( %unfaq.question.qname;, 

                                   %unfaq.answer.qname; )       >

<!ATTLIST %unfaq.faqItem.qname; 

          %unfaq.xmlns.attrib;                                  >



<!ELEMENT %unfaq.question.qname; ( #PCDATA )                    >

<!ATTLIST %unfaq.question.qname; 

          %unfaq.xmlns.attrib;                                  >



<!ELEMENT %unfaq.answer.qname;   ( #PCDATA )                    >

<!ATTLIST %unfaq.answer.qname; 

          %unfaq.xmlns.attrib;                                  >

The New DTD

The last step is to create the actual DTD machinery for the XHTML extension. It references both the XHTML modules and the new modules which contain the new semantics.

The new model module must be combined with XHTML's other model modules in a new DTD. The various qname modules corresponding to each model module must be collected into a new module, the qualified names collection, which contains all the qualified names for the extended XHTML markup language. The qualified names collection module must reference the qname module for each extension module defined; in this case, unfaq. It must also contain the declaration of the XHTML.xmlns.extra.attrib parameter entity as the collection of unfaq.xmlns.extra.attrib parameter entities.

In addition to a file constituting the collection of qname modules, the new markup language needs a driver file, which is read by validating parsers and other XML tools in order to validate the new markup language. The driver file must declare a parameter entity, XHTML.version, the value of which is the formal public identifier for the newly created markup language; for example,


"-//United Nations//DTD XHTML Frequently Asked Questions Extension 1.0//EN"

The driver file must contain a declaration of the parameter entity xhtml-qname-extra.mod, the value of which points to the qnames collection module for the newly created markup language. Next, the driver must contain a declaration of the parameter entity xhtml-model.mod, the value of which points to the model module for the new markup language. Finally, the driver points to the declaration module of the new markup language as well as to the XHTML modules themselves, including the module that holds all the assorted modularization machinery, the XHTML Modularization Framework Module.

Conclusions and Other Conceits

Resources

• Shane McCarron, How to create XHTML Family modules and markup languages for fun and profit

• Nicholas Chase, Modularization of XHTML

• W3C HTML Working Group, Modularization of XHTML

• Zvon's XHTML 1.0 Reference

• The XHTML-L mailing list

Even though I have talked about XHTML in terms of some of the early marketing promises of XML, as well as the needs and expectations of content creators, there are many other contexts within which XHTML is valuable, including small and nonstandard device profiles, as well as the maintenance needs of moving the Web from classic HTML as lingua franca to the still as yet uncertain future. In this article I have tried to give a very brief overview of what it's like to employ the XHTML modularization framework in an extension-superset direction. While I have left many details undiscussed, I think it should be clear that extending XHTML requires a good working knowledge of the mechanics of XML DTDs. There are a growing number of very detailed tutorials available from many places on the Web.