Menu

Building the Annotated XML Specification

September 12, 1998

Tim Bray

The design of XML 1.0 stretched over 20 months ending in February 1998, with input from a couple of hundred of the world's best experts in the area of markup, publishing, and Web design. The result of that work, the XML 1.0 Specification, is a highly condensed document that contains little or no information about how it came to read the way it does.

Even before the release of XML 1.0, it became obvious that some parts of the spec were self-explanatory, while others were causing headaches for its users.

The Annotated XML Specification addresses both of these problems. It supplements the basic specification, first with historical background and explanation of how things came to be the way they are, and second with detailed explanations of the portions of the spec that have proved difficult. Commercially, it has been a success; in its first month on the Web, it had over 100,000 page views from over 26,000 unique Internet addresses. It remains, by a substantial margin, the most popular item available at the XML.com site.

This article explains how I created the Annotated XML Specification. If you haven't looked at it, you might want to give it a glance before reading about it, or even better, open it in another browser window while you read about it here.

How the Annotated XML Specification Works

The architecture of the system, illustrated below, is simple enough:

Graphic representation of the document/creation structure.

The XML 1.0 specification is accessed in read-only mode. For convenience, I keep a local copy in a file called xml.xml. All the annotations live in another single file (about 25% larger than the XML specification itself) named notes.xml. A Java program, based on my Lark processor, reads both xml.xml and notes.xml and builds in-memory tree-structured representations of both. After processing all the links in notes.xml, the program writes out a file called target.html, which is the annotated version of the spec, and a large number of small files, each containing one of the annotations. The rest of this article outlines what's in those files and what the program does.

Design Choices for the Annotation

The first question I had to face in constructing the annotation was whether or not to use the highly-unfinished XLink and XPointer technologies, currently under development in the World Wide Web Consortium (W3C) XML Activity. XLink and XPointer have two large advantages: they are built for XML and can point at arbitrary locations inside a document.

On the other hand, neither spec is nearly finished, and they have changed (in syntax, if not at a conceptual level) from draft to draft. Furthermore, there were (in early 1998) neither commercial nor freeware implementations available. So if I were going to use this technology, I was going to have to write all the software myself. I decided to go ahead with XLink and XPointer, and while the syntax described here is about a year behind the latest drafts, I believe that what I've done is conceptually in tune with the current thinking, so the syntax will be easy to upgrade once the spec settles down.

XLink: A Quick Review

The XLink spec is concerned with recognizing which elements are being used as links, and with giving those links some useful structure and properties. HTML has a very simple solution to this set of problems: all linking elements have to be named A, and they have to point at one "resource", and the resource's address is found in the HREF= attribute.

<a href="resource_address">

XLink tries to be more general than HTML: any element can serve as a link, and is identified as such by using a magic reserved attribute, xml:link=. There are two kinds of XLinks, identified by the values xml:link="simple" and xml:link="extended". In the annotation, I only used the extended flavor, so I won't discuss simple XLinks here. Extended XLinks can have a bunch of useful associated information:

in-line
If a linking element is "in-line", the element itself is one of the ends of the link. All HTML links are in-line, as are all the links in the Annotated Spec. Out-of-line links are really the Wild Blue Yonder of hypertext theory and practice.
labels
XLinks can have labels, both machine-readable (provided in the role= attribute) and human-readable (in the title= attribute).
behavior
XLink includes some tools for controlling the behavior of a link when it's being followed. In HTML, links really have only one behavior; you are at one page, you follow a link, then you're somewhere else. Since I had to use the Web as it is today to deliver the annotations, I wasn't able to get fancy with behaviors.

The x Element

In the Annotated Spec, I used an element named x as the chief linking element. If you were to dress it up with all the necessary attributes, one of these elements would look like this:

<x

  xml:link="extended"

  inline="true"

  content-role="commentary"

  content-title="Annotation" >

  ... contents of the linking element go here ...

</x>

If every one of the 312 annotations had to carry around all those attributes, the Annotated Spec would be hard to write and to work with. Fortunately, XML has attribute defaulting, so I could provide all these attributes just once, in the document header:


<!DOCTYPE Annotations [

 <!ELEMENT x (here|spec)+>

 <!ATTLIST x

   xml:link       CDATA  #FIXED "extended"

   inline         CDATA  #FIXED "true"

   content-role   CDATA  #FIXED "commentary"

   content-title  CDATA  #FIXED "Annotation" 

   id             ID     #REQUIRED> ]>

<Annotations>

<x id="first-link-id"> ... content of first link ... </x>

<x id="second-link-id"> ... content of second link ... </x>

...

</Annotations>

The real role of the x element is to hold one here element and a bunch of spec elements. The here element actually contains the text of the annotation, while each of the spec elements points at a location in the XML spec where the annotation applies. Most of the annotations apply to only one location, but there are a few that attach to many places. For example, there is an annotation saying that DTD keywords (such as DOCTYPE , SYSTEM, ELEMENT, and ATTLIST) must be in upper-case; this annotation is attached to the first definition of each of these keywords.

The here Element

This element is what XLink calls a "Locator" - it serves as one end of the extended link, and contains the annotation. It has a lot of attributes, most of which are defaulted and don't actually appear in the body of the document. Here's the declaration, with all those attributes:


 <!ELEMENT here ANY>

 <!ATTLIST here

  xml:link CDATA #FIXED "locator"

  actuate  CDATA #FIXED "auto"

  show     CDATA #FIXED "replace"

  role     CDATA #FIXED "annotation"

  title    CDATA #REQUIRED

  href     CDATA #FIXED "here()"

  index    CDATA #IMPLIED">

Here's what all those attributes mean:

xml:link="locator"
tells the processing program that this is a locator
actuate="auto"
specifies that the link should be processed as soon as it's found - this differs from the Web browser behavior of just displaying the link (as underlined blue text) then waiting for the user activate it.
show
tells the processor that the result of following this link should replace the target of the previous one that was followed.
role
tells the processor that this here element contains the annotation, not a pointer into the spec.
title
provides a human-readable label for this link - this is required to be present since it is used to generate the title in the Web implementation.
href
points to the annotation; which in the Annotated Spec is just the content of this element.
index
not part of the XLink apparatus - used to build an index of all the annotations; if it's not provided, the title value is used.

The content is declared as ANY and contains text marked up with HTML tags. Since this annotation was designed for Web delivery, and this content was designed to be read by humans, I felt that HTML was adequate to meet my formatting needs. HTML also had the advantage that I didn't have to write code to convert it for delivery. So far, I've found HTML perfectly satisfactory for this particular application. However, I do not draw the conclusion that HTML is going to be the right presentation solution for every, or even most, hypertext applications. Since the annotation is an XML document, the HTML has to be well-formed, to allow processing with XML-processor based tools.

Here's one of the here elements:


<here title='The Document Entity is Special'

 index='Document Entity, Special Status Of'>

<p>The differences between the document entity and 

any other external parsed entity are:</p>

<ol><li>The document entity can begin with an

<Sref href='&h;dt-xmldecl'>XML declaration</Sref>,

other external parsed entities with a 

<Sref href='&h;NT-TextDecl'>text declaration</Sref>.

</li>

<li>The document entity can contain a

<Sref href='&h;dt-doctype'>document type 

declaration</Sref>.</li></ol></here>

Note that there are a few magic non-HTML elements mixed in; in this case, the Sref element, which is used to encode a pointer back into the XML specification. The Annotated Spec also uses Xref (external reference) and Nref (reference to another annotation) elements. That pointer (in the href attribute) uses an entity reference, &h;, which contains the URL for the XML spec; this is a good idea since there are hundreds of these URLs in the Annotated spec, and the location I read the XML spec from might change.

The spec Element

This is another XLink "Locator", which contains a pointer into the XML spec, indicating what part of the spec the annotation is there to annotate. Here's its declaration:


<!ELEMENT spec EMPTY>

<!ATTLIST spec

  xml-link CDATA #FIXED "locator"

  actuate  CDATA "user"

  show     CDATA "replace"

  role     (Using|History|Tech|Misc|Example) "Misc"

  title    CDATA "Into XML Specification" 

  href     CDATA #REQUIRED>
The only really interesting attributes are role, saying which kind of annotation it is, and href, which contains the URL pointing into the spec. The allowed values of role correspond to the (U), (H), (T), (M), and (E) symbols that mark the annotations in the spec.

Here's an example, not just of a spec element, but of a whole x element with its spec and here children:


<x id='RfC1808URI'>

<spec role='Using' href='&s;id(RFC1808)'/>

<here title='RFC 1818 URL'>

<p><Xref href='ftp://ds.internic.net/rfc/rfc1808.txt'>

ftp://ds.internic.net/rfc/rfc1808.txt</Xref></p>

</here></x>

In this example, the target of the reference is easily identified, since it is just a bibliographic entry that has an id attribute.

XPointer: A Quick Review

In an XLink Locator, there is an href= attribute that gives the URI identifying an end of the link. An XPointer is a string of characters that is used after the # "fragment separator" character in that href= value. It points into the XML document by treating it as a tree structure and identifying numbered child and descendant nodes.

The best XPointers are those that are based on "ID Attributes", that is to say attributes that have been declared to have a unique value. These are easy for an XML Processor to find and traverse to, but there is a problem in that you don't know which attributes are so declared unless you are prepared to read the whole DTD. This means, if you read the XML spec carefully, that to be sure, you have to use a validating XML processor. In my case, I was able to get away with using Lark, my non validating processor, simply by assuming that any attribute whose name was id was an ID attribute.

XPointer provides quite a few different verbs for selecting objects inside the document tree; in the Annotated Spec I was able to get away with using only the id, descendant, child, and string verbs. Furthermore, I could have got by without using the string operator. This raises the question: if something as complex as the Annotated Spec can be constructed with just these operators, do we really need all the others?

Here are some interesting examples of XPointers from the Annotated Spec; they have been set up to point into the indicated part of the HTML version of the XML spec.


id(NT-Mixed)child(1,rhs)string(1,"ATA",4)

id(sec-xml-wg)descendant(18,name)string(1,"gue",4)

id(sec-guessing)child(8,p)string(1,XML.,4)

id(sec-prolog-dtd)descendant(2,vcnote)

I actually authored each of the 312 XPointers in the Annotated Spec by hand, finding IDs, counting children, and matching strings. At the summer 1998 XML Developers' Day conference, David Megginson showed how I could have programmed GNU Emacs (which is what I use for editing anyhow) to construct these automatically at the touch of a key. The lesson is that any sensible XML editing environment ought to make it easy to construct this kind of hyperlink.

Flipping the Links

Once I had authored all these hyperlinks, I faced the problem of how to display them. Unfortunately, in 1998 there weren't any Web browsers that could do the job, and in fact, I wanted this to be usable by anyone with a basic HTML browser. So my program had to turn the links around; instead of one XML file containing hundreds of links into another, I needed one HTML file (the annotated spec) containing hundreds of links, each to one little annotation file. All this was done in Java, as noted above. The rest of this article gives a step-by-step description of the program, and may be a bit challenging if you're not a Java programmer.

Step 1: Parse the Annotations

First, the program creates an instance of the Lark parser, and in one step, reads the annotations file:


  annot = new Lark();

  r = new XmlInputStream(new FileInputStream(args[0]));

  System.err.println("Parsing Annotations...");

  annot.buildTree(true);

  annot.saveText(true);

  aroot = annot.readXML(h, r);

After this operation, the variable aroot points to the root of the annotations document tree.

Step 2: Put the XLinks in a Vector

This code runs through the annotations tree and puts all the XLinks that it finds in the variable xlinks:


  Vector xlinks = new Vector();

  findXLinks(aroot, xlinks);

  System.out.println("xlinks: "+xlinks.size());

  ...

private static void findXLinks(Element e, Vector v)

...

    XLink link = XLink.isLink(e, sIDs);

    if (link != null) v.addElement(link);

...

    children = e.children();

    for (i = 0; i < children.size(); i++)

    {

      child = children.elementAt(i);

      if (child instanceof Element)

        findXLinks((Element) child, v);

    }

This code looks at each element, then calls itself recursively on that element's children. It uses the routine isLink to determine whether an element is an XLink:

  public static XLink isLink(Element e, Hashtable ids)

  ...

    String form = e.attributeValue("xml:link");

    if (form == null) return null;

    else if (form.equals("simple") || form.equals("extended"))

    {

      XLink link = new XLink(ids);

      link.loadFromElement(e, ids);

      return link;

    }

    ...

An element is an XLink if it has an attribute xml:link= whose value is either simple or extended.

Each XLink, once identified, is stored into an XLink object for later re-use; these objects are what populate the xlinks vector mentioned above. The function which loads up the XLink data structures is the same for the extended and locator linking elements, since their makeup is almost identical.:


private void loadFromElement(Element e, Hashtable ids)

  ...

  // load up role, content-role, etc. fields from attribute values

  mRole = e.attributeValue("role");

  ..

  splitHref(); // save the HRef and build an XPointer if there is one

  Vector children = e.children();

  for (i = 0; i < children.size(); i++)

  {

      if ( // check xml:link att to see if this child is a locator

      {

        XLink link = new XLink(ids);

        link.loadFromElement((Element) child, ids);

        mLocators.addElement(link);

Step 3: Traverse the XLinks

Once the xlinks vector is filled up, we run through it, traversing those whose role is not "annotation" - in effect, this means we traverse only the spec XPointers:


findXLinks(aroot, xlinks);

...

System.err.println("Traversing...");

for (i = 0; i < xlinks.size(); i++)

{

  XLink link = (XLink) xlinks.elementAt(i);

  targets = link.traverseExcept("annotation");

  if (targets.size() == 0)

  {

    System.out.println("Dangling Annotation!");

    link.dump(System.out);

  }

The routine traverseExcept returns the list of targets that the corresponding XPointer points at. If that list is of size 0, it means that the XPointer is broken and doesn't point at anything. Since I was constructing the XPointers by hand, I saw this message a lot during the construction of the Annotated Spec.

The example above doesn't show the actual traversal of a single link; here's the code which does that:


private void doTraverse(Vector ret, Element linkingElement)

  ...

  if (// I haven't already parsed this instance

  {

    // parse the instance

    System.err.println("Parsing target...");

    Lark xml = new Lark();

    ...

    sTargetRoot = xml.readXML(h, r); 

    System.err.println("Done.");

    }

  }

  ret.addElement(sTargetRoot);

  // if there's an XPointer, traverse that

  if (mXP != null)

    mXP.traverse(ret);

This code first looks to see whether the URL in the XLink has already been parsed. If it hasn't, it makes a new parser and parses that document. Since in this case, all the links are into the XML spec, this code only gets executed once, causing the parsing of the XML spec when the first spec locator element is traversed.

The code above is putting the results of each traversal in the vector ret. First of all, it inserts a pointer to the root of the target XML document, then checks to see if the XLink contains an XPointer (in the Annotated Spec, they all do) and if so, traverses that. The code for that is here:


public void traverse(Vector ret)

  ...

  // an XPointer is stored as an array of steps in an obvious way

  for (i = 0; i < mSteps.size(); i++)

  {

    step = (XPStep) mSteps.elementAt(i);

    kw = step.kw();

    switch (kw)

    {

    case sDescendant:

      for (j = 0; j < old.size(); j++)

      {

        start = (Element) old.elementAt(j);

        inOrder(start, ret, step, 0);

      }

      break;



    case sString:

      for (j = 0; j < old.size(); j++)

      {

        start = (Element) old.elementAt(j);

        searchForString(start, ret, step, 0);

      }

      break;



    case sID:

      ret.addElement(mIDs.get(step.type()));

      break;



    case sChild:

      ...

          for (k = 0; k < children.size() && !done; k++)

          {

            o = children.elementAt(k);

            if (step.match(o))

      ...

    case sRoot: case sHere: case sDitto: case sHTML:

    case sAncestor: case sPreceding: case sPSibling: 

    case sFollowing: case sFSibling: default:

      if (kw < sMaxKW)

        throw new Exception("Can't do '" + sKeywords[kw] + "' yet");

      else

        throw new Exception("Bogus KW value "+kw);

As the comment says, each XPointer object contains an array with each element being one of the XPointer's steps. First, look for a moment at the sChild code above. It uses the method step.match() to check whether one step matches a particular child.

The code for sDescendant also uses that code, but of course has to do an in-order traversal of the whole subtree. Note how few of the XPointer verbs are implemented.

This part of the code, following all the links and making the connections between the annotation and the spec, takes almost no time to run. Some of the elapsed time in running this program is spent in parsing the annotations file and the XML spec, but most of it goes into the task of loading these fairly large complex documents into trees in memory. This burns immense amounts of memory; the two files are together less than 500K in size, but the two trees in memory use well over 10 megabytes. There are some tricks that could be used to make the trees more compact, but even so, the lesson is that fully-parsed XML documents burn a lot of memory. This is one reason why it's a good idea to do as much work as possible through a stream API such as SAX. However, the annotation in particular and XPointer processing in general are examples of jobs that it would be really hard to do without having a tree in memory.

Step 4: Decorate the Target

At the end of all this work, the vector targets contains a list of all the locations in the XML spec that have annotations pointing at them. Next, the code runs through the XML spec and, for each element that is the target of a link, it adds a new child which is an HTML A element with an href= pointer pointing at the appropriate annotation file. This element is added as the last child of the annotated element, so the annotation symbol will appear at the end of the target. This is an arbitrary choice that worked fine for the Annotated Spec, but may not be appropriate for many other applications. If the XPointer points into the middle of a chunk of text, more work is required; the text has to be split into two nodes, and the HTML A element inserted between them:


for (k = 0; k < targets.size(); k++)

{

  Object o = targets.targetAt(k);

  Element lE = link.linkingElement();

  if (o instanceof Element)

  {

    ((Element) o).addChild(makeAnchor(lE, targets.roleAt(k)));

  }

  else if (o instanceof Text)

  {

    // OK, we're going to have to split this text node

    splitText((Text) o, targets.offsetAt(k),

      lE, targets.roleAt(k));

  }

Once all these extra elements have been spliced into the XML spec, writing out the annotated version is pretty easy. I already had a large amount of Java code that converts the XML to HTML, which is used to generate the HTML versions of the spec that you can find at the W3C site and elsewhere. This code works by having a Java class for each element; all I had to do was to add a new class to handle the new spliced-in A elements, which involved very little work aside from putting in the (H)-style decorations depending on the role attribute:


  System.err.println("Done.  Writing target...");



  xroot = XLink.currentRoot();

  out = new PrintStream(new FileOutputStream("target.html"));

  if (xroot != null)

    printer.writeHTML(xroot, out, false);

  out.close();

  System.err.println("Done.  Writing notes...");

Step 5: Write the Annotations

Writing out the annotations is pretty easy too. They'd been placed in a vector named notes constructed earlier in the process. For each one, we open a file, dump in the HTML text (with a bit of extra decoration) and the job's done.


    System.err.println("Done.  Writing notes...");



    for (i = 0; i < notes.size(); i++)

    {

      if (notes.elementAt(i) instanceof Element)

      {

        child = (Element) notes.elementAt(i);

        title = child.attributeValue("id");

        out = new PrintStream(new FileOutputStream(

                    "notes/" + title + ".html"));

        out.println("<HTML><HEAD><TITLE>" 

              + title + "</TITLE>");

        ...

We use the value of the id attribute for the filename; that's why it is #REQUIRED.

Conclusion

How Much Work Was It?

Building the Annotated Spec required writing quite a lot of Java code. Here's a summary of the lines of code required, which is a poor way to measure the amount of work. I didn't keep track of the amount of time it took, because I was writing and debugging the code as I wrote the annotations, and doing both of these things in parallel with a lot of traveling and lecturing and consulting.

Lines of code

Java File

Function

403

Annotate.java

Mainline, bookkeeping

52 Link.java All the information describing a link from the annotation file into the XML spec
276 XLink.java XLink processing
82 XPStep.java XPointer step processing
303 XPointer.java XPointer processing
1116   Total

This is nowhere near being a complete implementation of XLink and XPointer. It contains just enough logic to solve this particular application's problems.

Lessons

From a commercial point of view, the Annotated Spec is a huge success. We should be cautious in drawing conclusions from this exercise, since the average hypertext creator is not going to regard writing 1100-plus lines of Java as a normal part of the authoring process. I think, though, that some useful lessons do emerge from this work:

  • Hypertext annotation is a useful technique for adding value to complex reference texts.
  • The basic design of XLink and XPointer seem, in practice, to be sound, in terms of providing the machinery necessary to build a sophisticated, usable hypertext.
  • While this project did not involve a general-purpose implementation of XLink/XPointer, a very large part of the necessary logic was roughed-in without uncovering any unforeseen engineering problems.
  • Implementing XPointer, at the moment, requires loading all of a parsed document into an in-memory tree. This consumes excessive memory for large documents, so a "virtual tree" facility, allowing tree-walking without actual memory loading, will likely be essential for successful industrial implementations of XML hypertexts.
  • We need a solution to the problem of identifying ID attributes without having to use a validating processor.
  • We need some serious debate on the selection of verbs to be included in the XPointer specification. Quite likely, there is a case for having one or two more verbs than I used in putting together the Annotated Spec. On the other hand, it seems probable that XPointers would be improved by removing one or two of the many existing verbs.
  • The task of generating XPointers should be automated, not done by hand.
  • The Annotated spec had to be drastically transformed into conventional HTML for delivery, since there was no widely-deployed software available with the capability of displaying this type of hypertext. There is an important open question as to whether, in general, complex hypertexts will require extensive processing for delivery.

Copyright © Tim Bray, 1998. All rights reserved.