The Evolution of Web Documents

October 2, 1997

Dan Connolly, Rohit Khare, and Adam Rifkin

The Evolution of Web Documents

The Ascent of XML

Dan Connolly, Rohit Khare, Adam Rifkin

Abstract

HTML is the ubiquitous data format for Web pages; most information providers are not even aware that there are other options. But now, with the development of XML, that is about to change. Not only will the choices of data formats become more apparent, but they will become more attractive as well. Although XML succeeds HTML in time, its design is based on SGML, which predates HTML and the Web altogether. SGML was designed to give information managers more flexibility to say what they mean, and XML brings that principle to the Web. Because it allows the development of custom tagsets, we can think of XML as HTML without the "training wheels." In this article, we trace the history and evolution of Web data formats, culminating in XML. We evaluate the relationship of XML, HTML, and SGML, and discuss the impact of XML on the evolution of the Web.

1. World Wide Markup Language

The hypertext markup language is an SGML format.

--Tim Berners-Lee, 1991,
in "About HTML"

The idea that structured documents could be exchanged and manipulated if published in a standard, open format dates back to multiple efforts in the 1960s. In one endeavor, a committee of the Graphic Communications Association (GCA) created GenCode to develop generic typesetting codes for clients who used multiple vendors to typeset a variety of data. GenCode allowed them to maintain an integrated set of archives despite the fact that records were set in multiple types.

In another effort, IBM developed the Generalized Markup Language (GML) for its big internal publishing problems from manuals and press releases to legal contracts and project specifications. GML was designed so the same source files could be processed to produce books, reports, and electronic editions.

GML had a "simple" input syntax for typists, including the <> and </> tags we recognize today. Of course, GML also permitted lots of "cheating." Markup minimization allowed typists to elide obvious tags. Though these documents were easy for humans to type and read, they were not well suited for general purpose processing (that is, for computer applications). In fact, because very few document types were required at the time, people wrote special compilers, bound to each particular kind of document, to handle inputting appropriate data formats.

As more document types emerged--each requiring specially suited tagsets--so too did the need for a standard way to publish and manipulate each Document Type Definition (DTD). In the early 1980s, representatives of the GenCode and GML communities joined to form the American National Standards Institute (ANSI) committee on Computer Languages for the Processing of Text; their goal was to standardize ways to specify, define, and use markup in documents.

SGML, the Standardized Generalized Markup Language, was published as ISO 8879 in 1986 [17]. Developed for defining and using portable document formats, it was designed to be formal enough to allow proofs of document validity, structured enough to handle complex documents, and extensible enough to support management of large information repositories. While SGML might seem a victim of "design by committee" to the casual observer, it was successful in furnishing an interchange language that could be used to manipulate and exchange text documents.

By the late 1980s, SGML had caught on in organizations such as CERN,[1] where in a laboratory in Switzerland, a holiday hacker borrowed the funny-looking idiom for his then-new hypertext application. Indeed, Tim Berners-Lee, inventor of the World Wide Web, picked an assortment of markup tags from a sample SGML DTD used at CERN. In NeXUS, the original Web browser and editor, he used tags, style sheets for typesetting, and one more "killer feature": links.

By the time Mosaic took off in 1993, people were stretching the limits of HTML--using it as a hammer to bang nails everywhere. But HTML is not a hammer--even HTML 4.0, released in July 1997 [22], furnishes only a limited tagset--and no single tagset will suffice for all of the kinds of information on the Web.

Starting in 1992, HTML evolved from a somewhat ad-hoc syntax to a conforming SGML application. This did not happen for free, and it involved some rather ugly compromises. However, it was clearly worth the effort. It not only gives the specifications a solid foundation, the intent was that Web tools would implement HTML as a special case of generic SGML and stylesheet support. That way, changes to HTML could be dynamically propagated into the tools by just updating the DTD and stylesheets. This proved to be an idea before its time: the engineering cost was significant and the information providers did not have the necessary experience to take advantage of the extra degrees of freedom.

In 1992, the Web was not ready for a powerful, generic markup language: in its nascent stage, the Web needed one small tagset--suitable for most of its intended documents and simple enough for the authoring community to understand. That small tagset is HTML.

Basing HTML on SGML was the first step in bringing the SGML community to the World Wide Web: at that point, forward-looking companies began to shift their agendas to unite SGML with the Web [24]. Using SGML on the Web is risky. Because SGML has lots of optional features, the sender and receiver have to agree on some set of options. The engineering costs are compounded because the SGML specification does not follow accepted computer-science conventions for the description of languages [18]. For implementers, the specification is hard to read and contains many costly special cases.

The stage is set for XML, the Extensible Markup Language [10], which addresses the engineering complexity of SGML and the limitations of the fixed tag set in HTML.

2. Community-Wide Markup Languages

"When I use a word," Humpty Dumpty said, in a rather scornful tone, "it means just what I choose it to mean--neither more nor less."

--Lewis Carroll,
Through the Looking Glass

For any document to communicate successfully from author to readers, all parties concerned must agree that words all choose them to mean. Semantics can only be interpreted within the context of a community. For example, millions of HTML users worldwide agree that <B> means bold text, or that <H1> is a prominent top-level document heading. The same cannot be said, though, for the date 8-7-97, which reflects local culture. Or for <font FACE=Arial>, which is only usable by Microsoft Windows systems. The larger the community, the weaker the shared context; the smaller and more focused the community, the stronger the shared context becomes.

HTML is currently the only common tagset Web users can rely upon. Furthermore, HTML cannot be extended unilaterally, since the shared definition is maintained by a central standardization process that publishes new editions like 2.0, 3.2, and 4.0. Since semantics depend on shared agreements between readers and writers about the state of the world, there is a place for community-specific definitions. XML makes ontologies as Document Type Definitions, to decentralize the control of specialized markup languages. The emergence of richly annotated data structures catalyzes new applications for storing, sharing, and processing ideas.

2.1 Semantic Markup

Descriptive markup indicates the role or meaning of some part of a document. While <H1> is a generic expression of importance, <WARNING TYPE=Fire_hazard> is a much more specific expression. Calling the former "structure" and the latter "semantics" is indeed a matter of semantics, but it seems clear that the more specific the markup, the more meaningful the document and the less potential for confusion.

An ontology codifies the concepts that are noteworthy to a community so that everyone has a common level of understanding upon which future knowledge exchange can proceed. The reverse phenomenon is equally powerful: mastery of the jargon confers membership in the community. In this sense, community recapitulates ontology--but the tools to express private agreements have been late in coming. Communities are mirrored by ontology: when a large community has to use a single ontology, its value is diluted to the least common denominator (as exemplified by HTML itself).

When communities collide, ontological misunderstandings can develop for several reasons. Sometimes it is a matter of context, like the legal interpretation of "profit" according to national accounting and tax rules. Sometimes it is a matter of perception, like "offensive language" in a Platform for Internet Content Selection (PICS) content-rating [23]. Sometimes it is a matter of alternative jargon: "10BaseT cable" to a programmer is "Category 5 twisted pair" to a lineman. Sometimes it is a matter of intentional conflation, like Hollywood "profit," which refers to both the pile of cash in the studio account and the losses recorded in an actor's residuals.

The best remedy is to codify private ontologies that serve to identify the active context of any document. This is the ideal role for a well-tempered DTD. Consider two newspapers with specific in-house styles for bylines, captions, company names, and so on. Where they share stories on a wire service, for example, they can identify it as their story, or convert it according to an industry-wide stylebook. As competing DTDs are shared among the community, semantics are clarified by acclamation [15]. Furthermore, as DTDs themselves are woven into the Web, they can be discovered dynamically, further accelerating the evolution of community ontologies.

2.2 Generating New Markup Languages

XML was designed to provide an easy-

to-write, easy-to-interpret, and easy-to-implement subset of SGML. It was not designed to provide a "one Markup Language fits all" DTD, or a separate DTD for every tag. It was designed so that certain groups could create their own particular markup languages that meet their needs more quickly, efficiently, and (IMO) logically. It was designed to put an end once and for all to the tag-soup wars propagated by Microsoft and Netscape.

--Jim Cape, in a post to
comp.infosystems.www.
authoring.html on June 3, 1997

As the Web evolved, people and companies indeed found themselves extending the HTML tagset to perform special tasks. A rich marketplace of server-side-includes and macro-preprocessing extensions to HTML demonstrates that users understand the benefit of using local markup conventions to automate their in-house information management practices. And the cost of "dumbing down" to HTML is becoming more apparent as more organizations go beyond information dissemination to information exchange.

The fundamental problem is that HTML is not unilaterally extensible. A new tag potentially has ambiguous grammar (is it an element or does it need an end-tag?), ambiguous semantics (no metadata about the ontology it is based on), and ambiguous presentation (especially without stylesheet hooks). Instead, investing in SGML offers three critical features:

Extensibility

Authors can define new elements, containers, and attribute names at will.

Structure

A DTD can constrain the information model of a document. For example, a Chapter might require a Title element, an Author list, and one or more Paragraphs.

Validation

Every document can be validated. Furthermore, well-formedness can establish conformance to the structure mandated by the DTD.

XML is a simplified (but strict) subset of SGML that maintains the SGML features for extensibility, structure, and validation. XML is a standardized text format designed specifically for transmitting structured data to Web applications. Since XML aims to be easier to learn, use, and implement than full SGML, it will have clear benefits for World Wide Web users. XML makes it easier to define and validate document types, to author and manage SGML-compliant documents, and to transmit and share them across the Web. Its specification is less than a tenth of the size of SGML86's. XML is, in short, a worthy successor in the evolutionary sense.

The "well-formed" versus "valid" distinction is an important one. Since one can always extract and reflect the document structure from the document itself without its DTD, DTD-less documents are already self-describing containers. A DTD simply provides a tool for deciding whether the structure implicit in the body of the document matches the explicit structure (known in the vernacular as "validity"). This phenomenon is very isomorphic to the interface/implementation separation in components; in the XML model, the DTD is the interface and the body is the implementation. We discuss the implications of XML Section 3.1.

The working draft for XML 1.0 provides a complete specification in several parts: the extensible markup language itself [7], methods for associating hypertext linking [8], and forthcoming stylesheet mechanisms for use with XML. From the XML specification, we observe that expressive power, teachability, and ease of implementation were all major design considerations. And although XML is not backward-compatible with existing HTML documents, we note that documents that are HTML 4.0-compliant can easily be converted to XML.

In addition to modifying the syntax and semantics of document tag annotations, XML also changes our linking model by allowing authors to specify different types of document relationships: new linking technology allows the management of bidirectional and multiway links, as well as links to a span of text (within the same or other documents), as a supplement to the single point linking afforded by HTML's existing HREF-style anchors.

2.3 Leveraging Community-Wide Markup

Accepting that community-specific DTDs can represent an ontology and that XML makes it cost-effective to deploy them, the potential of XML-formatted data will catalyze new applications for capturing, distributing, and processing knowledge [19].

Two communities using XML to capture field-specific knowledge have already chalked up early victories: the Chemical Markup Language (CML) [21] and the Mathematical Markup Language (MathML) [16]. Storing and distributing information in XML databases in conjunction with Extensible Linking Language (XLL) can ease data import and data export problems, facilitate aggregation from multiple sources (data warehousing), and enable interactive access to large corpuses.

Web Automation promises the most dramatic leverage, though. Tools like webMethods' Web Interface Definition Language [2] bridge this gap between legacy Web data and structured XML data. WIDL encourages the extraction of information from unstructured data (such as HTML tables and forms) to produce more structured, meaningful XML reports; furthermore, employing WIDL one can synthesize information already stored as structured data into new reports using custom programming, linking, and automated information extrapolation. Manipulating XML-formatted data leverages a cleaner, more rigorous object model for accessing entities within a document, when compared with the Document Object Model's references to windows, frames, history lists, and formats [6].

3. On the Coevolution
of HTML and XML

We will now discuss in a little more detail the Struggle for Existence.

--Charles Darwin,
The Origin of the Species

Now that we have compared the values of HTML for global markup needs and XML for community-specific markup, let's see how all this pans out in practice. How will HTML adapt to the presence of of XML?

It will not be an either-or choice between HTML and XML; you do not have to plan for a Flag Day when your shop stops using HTML and starts using XML. Instead, as HTML tools evolved to support the whole range of XML, your choices will expand with them. Just as the value to information providers is becoming evident, the cost of generic markup is going down because XML is considerably simpler than SGML. In addition, the complimentary piece of infrastructure, stylesheets, is finally being deployed.

If a browser (or editor or other tool) supports stylesheets [12], support for individual tags does not have to be hardcoded. If you decide to add <part-number> tags to your documents and specify in a stylesheet that part-numbers should display in bold, a browser with stylesheet support can follow those directions. But, while it's clear how to add a <part-number> tag to XML, what about HTML?

3.1 Platforms and Borders:
Well-Formed HTML

HTML is built on the platform of SGML. The borders of SGML were originally set by the IETF in 1995, subsequently expanded by W3C in 1996, with HTML 3.2, and again in 1997 with HTML 4.0. But so far, the borders of HTML fit within the borders of SGML.

However, only part of the ground inside the SGML borders is fertile--the XML part. The rest is too expensive to maintain. Although some of HTML is sitting on that infertile ground, it should be a simple task to move a document from that crufty ground to the arable XML territory using the following rules:

Match every start-tag with an end-tag.
Replace > by /> at the end of empty tags.
Quote all your attribute values.
Use only lowercase for tag names and attribute names.

By the same token, it should be a simple task to move the HTML specification onto the XML platform. Let's look at those steps a bit more closely.

Consider the following:

 <p> Some Text <my-markup> More Text.

Is More Text inside the my-markup element or not? The SGML answer is: you have to look in the DTD to see whether my-markup is an empty element, or whether it can have content. The XML answer is: don't do that. Make it explicit, one way or another:

 
        <p> Some Text <my-empty-thing/> More Text.</p> 
        <p> Some Text <my-container> More Text. </my-container> </p>

Hence, rule one: match every start-tag with an end-tag. That's right, every p, li, dd, dt, tr, and td start-tag needs a matching end-tag. If you are still using a text editor to write HTML, you can take this as a hint to start looking at direct manipulation authoring tools, or at least text editors with syntax support for this sort of thing.

Rule two says that br, hr, and img elements turn into:

 
          <p> a line break: <br />, a horizontal rule: <hr />, 
          and an image: <img src="foo"/> </p>

Rule three takes the guesswork out of attribute value syntax. In HTML, quotation is required only in some cases, but it is always allowed. In XML, it is simply required.

Rules one through three only predict the evolution of the specifications, and rule four is especially uncertain. It may turn out that "use only lowercase" is changed to "use only uppercase"--it depends on how HTML adapts to the rules about case sensitivity in XML. XML tag names and attribute names can use characters from a variety of Unicode characters, and matching uppercase and lowercase versions of these characters is not as simple as it is in ASCII. As of this writing, the Working Group has decided to punt on the issue, so that names compare by exact match only.

3.2 Licensed to Tag

According to the official rules, extending HTML is the exclusive privilege of the central authorities. But everybody's doing it in various underground ways: they use , preprocessing extensions with <if> <then> <else> tags, and so on. Even Robin Cover, maintainer of the most comprehensive SGML bibliography on the Web, admits in [11]:

An experimental approach is being used in markup--exploiting the behavior of HTML browsers whereby unrecognized tags are simply ignored. If the non-HTML tags are causing problems in your browser, please let me know.

Once HTML and XML align, there will be legitimate alternatives to all these underground practices. You can add your <part-number> and <abstract> tags with confidence, knowing that your markup will be supported.

In fact, you have two choices regarding the level of confidence: you can make well-formed documents just by making sure your tags are balanced, there are no missing quotes, etc. A lot of tools check only at this level.

On the other hand, you want that sort of support from your tools, you will have to keep your documents valid: you will have to remember to put a <title> in every document, an alt attribute on every <img> element, and so on. In that case, adding tags to a document also requires creating a modified DTD.

For example, you might write:

 
        <?XML version="1.0"?> 
        <!doctype report system "html-report.dtd"> 
        <report><title>My Document</title>
          <abstract><p>...</p></abstract> 
          <section><h1>Introduction</h1> 
            ... 
          </section>

where report-html.dtd contains:

        <!entity % html4 system 
        "http://www.w3.org/TR/WD-html40-970917/sgml/HTML4-strict.dtd">
        %html4; 
        <!element report (title, abstract, section*)> 
        <!element abstract (p*)> 
        <!element section (h1, (%block;)*)>

Then you can validate that the document is not just any old HTML document, but it has a specific technical report structure for consistency with the other technical reports at your site. And you can use stylesheet-based typesetting tools to create professional looking PostScript or Portable Document Format (PDF) renditions.

3.3 Mix and Match, Cut and Paste

Not everyone who wants something different from standard HTML has to write his or her own DTD. Perhaps, in the best of Internet and Web tradition, you can leverage someone else's work. Perhaps you would like to mix elements of HTML with elements of DocBook [3] or a Dublin Core [9] DTD.

Unfortunately, achieving this mixture with DTDs is very awkward. Yet the ability to combine independently developed resources is an essential survival property of technology in a distributed information system. The ability to combine filters and pipes and scripts has kept UNIX alive and kicking long past its expected demise. In his keynote address at Seybold San Francisco [5], Tim Berners-Lee called this powerful notion "intercreativity."

Combining DTDs that were developed independently exposes limitations in the design of SGML for things like namespaces, subclassing, and modularity and reuse in general.

There is a great tension between the need for intercreativity and these limitations in SGML DTDs. One strategy under discussion is to introduce qualified names a la Modula, C++, or Java into XML. For example, you might want to enrich your home page by the use of an established set of business card element types. This strategy suggests markup like this:

 
        <xml::namespace href="http://bcard.org/9801"
          as="bcard" />
        <html> <head><title>Dan's Home Page and Business Card</title>
        </head> 
          <body> 
            <bcard::card>
              <h1><bcard::name>Dan Connolly
              </bcard::name><br> 
                <bcard::title>Grand 
                  Poobah</bcard:title></h1>
              <p>Phone: <bcard::phone>555-
                1212</bcard::phone></p> 
            </bcard::card> 
            <p>...</p> 
          </body> 
        </html>

This markup is perfectly well-formed, but the strategy does not address DTD validation. Another strategy for mixing and matching elements is to use SGML Architectures [20]. Or perhaps a more radical course of research is needed to rethink the connection between tag names, element names, and element types [4].

3.4 The Future Standardization of XML

The language designer should be familiar with many alternative features designed by others, and should have excellent judgment in choosing the best and rejecting any that are mutually inconsistent . . . One thing he should not do is to include untried ideas of his own. His task is consolidation, not innovation.

--C.A.R. Hoare

If it seems that XML is moving very fast, look again. The community is moving very fast to exploit XML, but the momentum against changes to XML itself is tremendous. XML is not a collection of new ideas; it is a selection of tried-and-true ideas. These ideas are implemented in a host of conforming SGML systems, and employed in truly massive SGML document repositories. Changes to a technology with this many dependencies are evaluated with utmost care.

XML is essentially just SGML with many of the obscure features thrown out (Appendix A of the specification lists SUBDOC, RANK, and quite a few others). The result is much easier to describe, understand, and implement, despite the fact that every document that conforms to the XML specification also conforms to the SGML specification.

Almost.

In a few cases, the design of SGML has rules that would be difficult to explain in the XML specification. And they prohibit idioms that are quite useful, such as multiple <!ATTLIST ...> declarations for the same element type. In these cases, the XML designers have participated in the ongoing ISO revision of SGML. The result is the WebSGML Technical Corrigendum [14]--a sort of "patch" to the SGML standard.

Every document that conforms to the XML specification does indeed conform to SGML-as-corrected, and the W3C XML Working Group and the ISO Working Group have agreed to keep that constraint in place.

So the wiggle-room in the XML specification is actually quite small. The W3C XML Working Group is considering a few remaining issues, and they release drafts for public review every month or so. The next step in the W3C process, after the Working Group has addressed all the issues they can find, is for the W3C Director to issue the specification as a Proposed Recommendation and call for votes from the W3C membership. Based on the outcome of the votes, the Director will then decide whether the document should become a W3C Recommendation, go back to the Working Group for further review, or be canceled altogether.

Outside the core XML specification, there is much more working room. The XLL specification [8] is maturing, but there are still quite a few outstanding issues. And, work on the eXtensible Stylesheet Language (XSL) is just beginning.

4. The Ascent of XML in the Evolution of Knowledge from Information

Node content must be left free to evolve.

--Tim Berners-Lee, 1991, in "About Document Formats," http://www.
w3.org/DesignIssues/Formats.html

The World Wide Web Consortium, the driving force behind XML, sees its mission as leading the evolution of the Web. In the competitive market of Internet technologies, it is instructive to consider how the Web trounced competing species of protocols. Though it shared several adaptations common to Internet protocols, such as "free software spreads faster," "ASCII systems spread faster than binary ones," and "bad protocols imitate; great protocols steal," it leveraged one unique strategy: self-description. The Web can be built upon itself. Universal Resource Identifiers (URIs), machine-readable data formats, and machine-readable specifications can be knit together into an extensible system that assimilates any competitors. In essence, the emergence of XML on the spectrum of Web data formats caps the struggle toward realizing the original vision of the Web by its creators.

The designers of the Web knew that it must adapt to new data formats, so they appropriated the MIME Content Type system. On the other hand, some types were more equal than others: the Web prefers HTML over PDF, Microsoft Word, and myriad others, because of a general trend over the last seven years of Web history from stylistic formatting to structural markup to semantic markup. Each step up in the Ascent of Formats adds momentum to Web applications, from PostScript (opaque, operational, formatting); to troff (readable, operational, formatting); to Rich Text Format (RTF) (readable, extensible, formatting); to HTML (readable, declarative, limited descriptive semantics like <ADDRESS>); now to XML; and on to intelligent metadata formats such as PICS labels.

The Web itself is becoming a kind of cyborg intelligence: human and machine, harnessed together to generate and manipulate information. If automatability is to be a human right, then machine assistance must eliminate the drudge work involved in exchanging and manipulating knowledge, as indicated by MIT Laboratory for Computer Science Director Michael Dertouzous [13]. As Douglas Adams described [1], the shift from strucutral HTML markup to semantic XML markup is a critical phase in the struggle to information space into a universal knowledge network.

Acknowledgments

This paper is based on our experiences over several years' experience working with the Web community. Particular plaudits go to our colleagues at the World Wide Web Consortium, including Tim Berners-Lee; the teams at MCI Internet Architecture and Caltech Infospheres; and the group at webMethods, especially Charles Allen.

Adams, Douglas. The Hitchhiker's Guide to the Galaxy, Ballantine Books, 1979.
Allen, Charles. "Automating the Web with WIDL," World Wide Web Journal 2: 4, Autumn 1997. Available at http://www.webmethods.com/technology/widl.html
Allen, Terry, and Eve Maler. DocBook Version 3.0 Maintainer's Guide, O'Reilly and Associates, 1997. Available at http://www.oreilly.com/davenport/
Akpotsui, E., V. Quint, and C. Roisin. "Type Modelling for Document Transformation in Structured Editing Systems," Mathematical and Computer Modelling 25: 4, 1997, pp. 1-19.
Berners-Lee, Tim. Keynote Address, Seybold San Francisco, February 1996. Available at http://www.w3.org/Talks/9602seybold/slide6.htm
Bosak, Jon. "XML, Java, and the Future of the Web," 1997. Available at http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm
Bray, Tim, Jean Paoli, and C.M. Sperberg-McQueen. "Extensible Markup Language (XML): Part I Syntax," World Wide Web Consortium Working Draft (Work in Progress), August 1997. Available at http://www.w3.org/TR/WD-xml-lang.html
Bray, Tim, and Steve DeRose. "Extensible Markup Language (XML): Part II. Linking," World Wide Web Consortium Working Draft (Work in Progress), July 1997. Available at http://www.w3.org/TR/WD-xml-link.html
Burnard, Lou, Eric Miller, Liam Quin, and C.M. Sperberg-McQueen. "A Syntax for Dublin Core Metadata: Recommendations from the Second Metadata Workshop," 1996. Available at http://www.uic.edu/~cmsmcq/tech/metadata.syntax.html
Connolly, Dan, and Jon Bosak. "Extensible Markup Language (XML)," W3C Activity Group page, 1997. Available at http://www.w3.org/XML/
Cover, Robin. "SGML Page: Caveats, Work in Progress," 1997. Available at http://www.sil.org/sgml/caveats.html
Culshaw, Stuart, Michael Leventhal, and Murray Maloney. "XML and CSS," World Wide Web Journal, 2: 4, Autumn 1997. Available at http://shoal.w3.org/w3j-xml/cssxml/grifcss.htm
Dertouzous, Michael. What Will Be, HarperEdge, 1997.
Goldfarb, Charles F. Proposed TC for WebSGML Adaptations for SGML, ISO/IEC JTC1/SC18/WG8, WG8 Approved Text, September 1997. Available at http://www.sgmlsource.com/8879rev/n1929.htm
Hale, Constance. "Wired Style: Principles of English Usage in the Digital Age," Hardwired Publishing, June 1997.
Ion, Patrick, and Robert Miner. "Mathematical Markup Language," W3C Working Draft, May 1997. http://www.w3.org/pub/WWW/TR/WD-math
ISO 8879:1986. ISO 8879, Information Processing--Text and Office Systems--Standard Generalized Markup Language (SGML), 1986.
Kaelbling, Michael J. "On Improving SGML," Electronic Publishing: Origination, Dissemination and Design (EPODD) 3: 2, May 1990, pp. 93-98.
Khare, Rohit, and Adam Rifkin. "XML: A Door to Automated Web Applications," in IEEE Internet Computing 1: 4, July/August 1997, pp. 78-87. Available at http://www.cs.caltech.edu/~adam/papers/xml/x-marks-the-spot.html
Kimber, W. Eliot, and ISOGEN International Corp. "An Approach to Literate Programming With SGML Architectures," July 1997. Available at http://www.isogen.com/papers/litprogarch/litprogarch.html
Murray-Rust, Peter. "Chemical Markup Language (CML)," Version 1.0, January 1997. Available at http://www.venus.co.uk/omf/cml/
Raggett, Dave, Arnaud Le Hors, and Ian Jacobs. "HTML 4.0 Specification," World Wide Web Consortium Working Draft (Work in Progress), September 1997. Available at http://www.w3.org/TR/WD-html40/
Resnick, Paul, and Jim Miller. "PICS: Internet Access Controls without Censorship," Communications of the ACM, Volume 39, 1996, pp. 87-93. Available at http://www.w3.org/pub/WWW/PICS/iacwcv2.htm
Rubinsky, Yuri, and Murray Maloney. "SGML and the Web: Small Steps Beyond HTML," Charles F. Goldfarb series on Open Information Management, Prentice Hall, 1997.

About the Authors

Dan Connolly

connolly@w3.org

Dan Connolly is the leader of the W3C Architecture Domain. His work on formal systems, computational linguistics, and the development of open, distributed hypermedia systems began at the University of Texas at Austin, where recieved a B.S. in Computer Science in 1990. While developing hypertext production and delivery software in 1992, he began contributing to the World Wide Web project, and in particular, the HTML specification. He presented a draft at the First International World Wide Web Conference in 1994 in Geneva, and edited the draft until it was published as the HTML 2.0 specification, Internet RFC1866, in November 1995. Today he is the chair of the W3C HTML Working Group and a member of the W3C XML Working Group. His research interest is the on the value of formal descriptions of chaotic systems like the Web, especially in the consensus-building process.

Rohit Khare

khare@alumni.caltech.edu

Rohit Khare is a member of the MCI Internet Architecture staff in Boston, MA. He was previously on the technical staff of the World Wide Web Consortium at MIT, where he focused on security and electronic commerce issues. He has been involved in the development of cryptographic software tools and Web-related standards development. Rohit received a B.S. in Engineering and Applied Science and in Economics from California Institute of Technology in 1995. He will enter the Ph.D. program in Computer Science at the University of California, Irvine in Fall 1997.

Adam Rifkin

adam@cs.caltech.edu

Adam Rifkin received his B.S. and M.S. in Computer Science from the College of William and Mary. He is presently pursuing a Ph.D. in computer science at the California Institute of Technology, where he works with the Caltech Infospheres Project on the composition of distributed active objects. His efforts with Infospheres have won best paper awards both at the Fifth IEEE International Symposium on High Performance Distributed Computing in August 1996, and at the Thirtieth Hawaii International Conference on System Sciences in January 1997. He has done Internet consulting and performed research with several organizations, including Canon, Hewlett-Packard, Reprise Records, Griffiss Air Force Base, and the NASA-Langley Research Center.