Building the Annotated XML Specification
The design of XML 1.0 stretched over 20 months ending in February 1998, with input from a couple of hundred of the world's best experts in the area of markup, publishing, and Web design. The result of that work, the XML 1.0 Specification, is a highly condensed document that contains little or no information about how it came to read the way it does.
Even before the release of XML 1.0, it became obvious that some parts of the spec were self-explanatory, while others were causing headaches for its users.
The Annotated XML Specification addresses both of these problems. It supplements the basic specification, first with historical background and explanation of how things came to be the way they are, and second with detailed explanations of the portions of the spec that have proved difficult. Commercially, it has been a success; in its first month on the Web, it had over 100,000 page views from over 26,000 unique Internet addresses. It remains, by a substantial margin, the most popular item available at the XML.com site.
This article explains how I created the Annotated XML Specification. If you haven't looked at it, you might want to give it a glance before reading about it, or even better, open it in another browser window while you read about it here.
|

The XML 1.0 specification is accessed in read-only mode. For convenience, I keep a local copy in a file called xml.xml. All the annotations live in another single file (about 25% larger than the XML specification itself) named notes.xml. A Java program, based on my Lark processor, reads both xml.xml and notes.xml and builds in-memory tree-structured representations of both. After processing all the links in notes.xml, the program writes out a file called target.html, which is the annotated version of the spec, and a large number of small files, each containing one of the annotations. The rest of this article outlines what's in those files and what the program does.
The first question I had to face in constructing the annotation was whether or not to use the highly-unfinished XLink and XPointer technologies, currently under development in the World Wide Web Consortium (W3C) XML Activity. XLink and XPointer have two large advantages: they are built for XML and can point at arbitrary locations inside a document.
On the other hand, neither spec is nearly finished, and they have changed (in syntax, if not at a conceptual level) from draft to draft. Furthermore, there were (in early 1998) neither commercial nor freeware implementations available. So if I were going to use this technology, I was going to have to write all the software myself. I decided to go ahead with XLink and XPointer, and while the syntax described here is about a year behind the latest drafts, I believe that what I've done is conceptually in tune with the current thinking, so the syntax will be easy to upgrade once the spec settles down.
The XLink spec is concerned with recognizing which elements are being used as links, and with giving those links some useful structure and properties. HTML has a very simple solution to this set of problems: all linking elements have to be named A, and they have to point at one "resource", and the resource's address is found in the HREF= attribute.
<A HREF="resource_address">
XLink tries to be more general than HTML: any element can serve as a link, and is identified as such by using a magic reserved attribute, xml:link=. There are two kinds of XLinks, identified by the values xml:link="simple" and xml:link="extended". In the annotation, I only used the extended flavor, so I won't discuss simple XLinks here. Extended XLinks can have a bunch of useful associated information:
In the Annotated Spec, I used an element named x as the chief linking element. If you were to dress it up with all the necessary attributes, one of these elements would look like this:
<x xml:link="extended" inline="true" content-role="commentary" content-title="Annotation" > ... contents of the linking element go here ... </x>
If every one of the 312 annotations had to carry around all those attributes, the Annotated Spec would be hard to write and to work with. Fortunately, XML has attribute defaulting, so I could provide all these attributes just once, in the document header:
<!DOCTYPE Annotations [ <!ELEMENT x (here|spec)+> <!ATTLIST x xml:link CDATA #FIXED "extended" inline CDATA #FIXED "true" content-role CDATA #FIXED "commentary" content-title CDATA #FIXED "Annotation" id ID #REQUIRED> ]> <Annotations> <x id="first-link-id"> ... content of first link ... </x> <x id="second-link-id"> ... content of second link ... </x> ... </Annotations>
The real role of the x element is to hold one here element and a bunch of spec elements. The here element actually contains the text of the annotation, while each of the spec elements points at a location in the XML spec where the annotation applies. Most of the annotations apply to only one location, but there are a few that attach to many places. For example, there is an annotation saying that DTD keywords (such as DOCTYPE , SYSTEM, ELEMENT, and ATTLIST) must be in upper-case; this annotation is attached to the first definition of each of these keywords.
This element is what XLink calls a "Locator" - it serves as one end of the extended link, and contains the annotation. It has a lot of attributes, most of which are defaulted and don't actually appear in the body of the document. Here's the declaration, with all those attributes:
<!ELEMENT here ANY> <!ATTLIST here xml:link CDATA #FIXED "locator" actuate CDATA #FIXED "auto" show CDATA #FIXED "replace" role CDATA #FIXED "annotation" title CDATA #REQUIRED href CDATA #FIXED "here()" index CDATA #IMPLIED">
Here's what all those attributes mean:
The content is declared as ANY and contains text marked up with HTML tags. Since this annotation was designed for Web delivery, and this content was designed to be read by humans, I felt that HTML was adequate to meet my formatting needs. HTML also had the advantage that I didn't have to write code to convert it for delivery. So far, I've found HTML perfectly satisfactory for this particular application. However, I do not draw the conclusion that HTML is going to be the right presentation solution for every, or even most, hypertext applications. Since the annotation is an XML document, the HTML has to be well-formed, to allow processing with XML-processor based tools.
Here's one of the here elements:
<here title='The Document Entity is Special' index='Document Entity, Special Status Of'> <p>The differences between the document entity and any other external parsed entity are:</p> <ol><li>The document entity can begin with an <Sref href='&h;dt-xmldecl'>XML declaration</Sref>, other external parsed entities with a <Sref href='&h;NT-TextDecl'>text declaration</Sref>. </li> <li>The document entity can contain a <Sref href='&h;dt-doctype'>document type declaration</Sref>.</li></ol></here>
Note that there are a few magic non-HTML elements mixed in; in this case, the Sref element, which is used to encode a pointer back into the XML specification. The Annotated Spec also uses Xref (external reference) and Nref (reference to another annotation) elements. That pointer (in the href attribute) uses an entity reference, &h;, which contains the URL for the XML spec; this is a good idea since there are hundreds of these URLs in the Annotated spec, and the location I read the XML spec from might change.
This is another XLink "Locator", which contains a pointer into the XML spec, indicating what part of the spec the annotation is there to annotate. Here's its declaration:
The only really interesting attributes are role, saying which kind of annotation it is, and href, which contains the URL pointing into the spec. The allowed values of role correspond to the<!ELEMENT spec EMPTY> <!ATTLIST spec xml-link CDATA #FIXED "locator" actuate CDATA "user" show CDATA "replace" role (Using|History|Tech|Misc|Example) "Misc" title CDATA "Into XML Specification" href CDATA #REQUIRED>
Here's an example, not just of a spec element, but of a whole x element with its spec and here children:
In this example, the target of the reference is easily identified, since it is just a bibliographic entry that has an id attribute.<x id='RfC1808URI'> <spec role='Using' href='&s;id(RFC1808)'/> <here title='RFC 1818 URL'> <p><Xref href='ftp://ds.internic.net/rfc/rfc1808.txt'> ftp://ds.internic.net/rfc/rfc1808.txt</Xref></p> </here></x>
In an XLink Locator, there is an href= attribute that gives the URI identifying an end of the link. An XPointer is a string of characters that is used after the # "fragment separator" character in that href= value. It points into the XML document by treating it as a tree structure and identifying numbered child and descendant nodes.
The best XPointers are those that are based on "ID Attributes", that is to say attributes that have been declared to have a unique value. These are easy for an XML Processor to find and traverse to, but there is a problem in that you don't know which attributes are so declared unless you are prepared to read the whole DTD. This means, if you read the XML spec carefully, that to be sure, you have to use a validating XML processor. In my case, I was able to get away with using Lark, my non validating processor, simply by assuming that any attribute whose name was id was an ID attribute.
XPointer provides quite a few different verbs for selecting objects inside the document tree; in the Annotated Spec I was able to get away with using only the id, descendant, child, and string verbs. Furthermore, I could have got by without using the string operator. This raises the question: if something as complex as the Annotated Spec can be constructed with just these operators, do we really need all the others?
Here are some interesting examples of XPointers from the Annotated Spec; they have been set up to point into the indicated part of the HTML version of the XML spec.
id(NT-Mixed)child(1,rhs)string(1,"ATA",4) id(sec-xml-wg)descendant(18,name)string(1,"gue",4) id(sec-guessing)child(8,p)string(1,XML.,4) id(sec-prolog-dtd)descendant(2,vcnote)
I actually authored each of the 312 XPointers in the Annotated Spec by hand, finding IDs, counting children, and matching strings. At the summer 1998 XML Developers' Day conference, David Megginson showed how I could have programmed GNU Emacs (which is what I use for editing anyhow) to construct these automatically at the touch of a key. The lesson is that any sensible XML editing environment ought to make it easy to construct this kind of hyperlink.
|
First, the program creates an instance of the Lark parser, and in one step, reads the annotations file:
annot = new Lark();
r = new XmlInputStream(new FileInputStream(args[0]));
System.err.println("Parsing Annotations...");
annot.buildTree(true);
annot.saveText(true);
aroot = annot.readXML(h, r);
After this operation, the variable aroot points to the root of
the annotations document tree.
This code runs through the annotations tree and puts all the XLinks that it finds in the variable xlinks:
Vector xlinks = new Vector();
findXLinks(aroot, xlinks);
System.out.println("xlinks: "+xlinks.size());
...
private static void findXLinks(Element e, Vector v)
...
XLink link = XLink.isLink(e, sIDs);
if (link != null) v.addElement(link);
...
children = e.children();
for (i = 0; i < children.size(); i++)
{
child = children.elementAt(i);
if (child instanceof Element)
findXLinks((Element) child, v);
}
This code looks at each element, then calls itself recursively on that
element's children. It uses the routine isLink to determine
whether an element is an XLink:
public static XLink isLink(Element e, Hashtable ids)
...
String form = e.attributeValue("xml:link");
if (form == null) return null;
else if (form.equals("simple") || form.equals("extended"))
{
XLink link = new XLink(ids);
link.loadFromElement(e, ids);
return link;
}
...
An element is an XLink if it has an attribute xml:link= whose
value is either simple or extended.
Each XLink, once identified, is stored into an XLink object for later re-use; these objects are what populate the xlinks vector mentioned above. The function which loads up the XLink data structures is the same for the extended and locator linking elements, since their makeup is almost identical.:
private void loadFromElement(Element e, Hashtable ids)
...
// load up role, content-role, etc. fields from attribute values
mRole = e.attributeValue("role");
..
splitHref(); // save the HRef and build an XPointer if there is one
Vector children = e.children();
for (i = 0; i < children.size(); i++)
{
if ( // check xml:link att to see if this child is a locator
{
XLink link = new XLink(ids);
link.loadFromElement((Element) child, ids);
mLocators.addElement(link);
Once the xlinks vector is filled up, we run through it, traversing those whose role is not "annotation" - in effect, this means we traverse only the spec XPointers:
findXLinks(aroot, xlinks);
...
System.err.println("Traversing...");
for (i = 0; i < xlinks.size(); i++)
{
XLink link = (XLink) xlinks.elementAt(i);
targets = link.traverseExcept("annotation");
if (targets.size() == 0)
{
System.out.println("Dangling Annotation!");
link.dump(System.out);
}
The routine traverseExcept returns the list of targets that
the corresponding XPointer points at. If that list is of size 0, it means
that the XPointer is broken and doesn't point at anything. Since I was
constructing the XPointers by hand, I saw this message a lot during the
construction of the Annotated Spec.
The example above doesn't show the actual traversal of a single link; here's the code which does that:
private void doTraverse(Vector ret, Element linkingElement)
...
if (// I haven't already parsed this instance
{
// parse the instance
System.err.println("Parsing target...");
Lark xml = new Lark();
...
sTargetRoot = xml.readXML(h, r);
System.err.println("Done.");
}
}
ret.addElement(sTargetRoot);
// if there's an XPointer, traverse that
if (mXP != null)
mXP.traverse(ret);
This code first looks to see whether the URL in the XLink has already been
parsed. If it hasn't, it makes a new parser and parses that document.
Since in this case, all the links are into the XML spec, this code
only gets executed once, causing the parsing of the XML spec when the first
spec locator element is traversed.
The code above is putting the results of each traversal in the vector ret. First of all, it inserts a pointer to the root of the target XML document, then checks to see if the XLink contains an XPointer (in the Annotated Spec, they all do) and if so, traverses that. The code for that is here:
public void traverse(Vector ret)
...
// an XPointer is stored as an array of steps in an obvious way
for (i = 0; i < mSteps.size(); i++)
{
step = (XPStep) mSteps.elementAt(i);
kw = step.kw();
switch (kw)
{
case sDescendant:
for (j = 0; j < old.size(); j++)
{
start = (Element) old.elementAt(j);
inOrder(start, ret, step, 0);
}
break;
case sString:
for (j = 0; j < old.size(); j++)
{
start = (Element) old.elementAt(j);
searchForString(start, ret, step, 0);
}
break;
case sID:
ret.addElement(mIDs.get(step.type()));
break;
case sChild:
...
for (k = 0; k < children.size() && !done; k++)
{
o = children.elementAt(k);
if (step.match(o))
...
case sRoot: case sHere: case sDitto: case sHTML:
case sAncestor: case sPreceding: case sPSibling:
case sFollowing: case sFSibling: default:
if (kw < sMaxKW)
throw new Exception("Can't do '" + sKeywords[kw] + "' yet");
else
throw new Exception("Bogus KW value "+kw);
As the comment says, each XPointer object contains an array with each element
being one of the XPointer's steps.
First, look for a moment at the sChild code above.
It uses the method step.match() to check whether one step
matches a particular child.
The code for sDescendant also uses that code, but of course has to do an in-order traversal of the whole subtree. Note how few of the XPointer verbs are implemented.
This part of the code, following all the links and making the connections between the annotation and the spec, takes almost no time to run. Some of the elapsed time in running this program is spent in parsing the annotations file and the XML spec, but most of it goes into the task of loading these fairly large complex documents into trees in memory. This burns immense amounts of memory; the two files are together less than 500K in size, but the two trees in memory use well over 10 megabytes. There are some tricks that could be used to make the trees more compact, but even so, the lesson is that fully-parsed XML documents burn a lot of memory. This is one reason why it's a good idea to do as much work as possible through a stream API such as SAX. However, the annotation in particular and XPointer processing in general are examples of jobs that it would be really hard to do without having a tree in memory.
At the end of all this work, the vector targets contains a list of all the locations in the XML spec that have annotations pointing at them. Next, the code runs through the XML spec and, for each element that is the target of a link, it adds a new child which is an HTML A element with an href= pointer pointing at the appropriate annotation file. This element is added as the last child of the annotated element, so the annotation symbol will appear at the end of the target. This is an arbitrary choice that worked fine for the Annotated Spec, but may not be appropriate for many other applications. If the XPointer points into the middle of a chunk of text, more work is required; the text has to be split into two nodes, and the HTML A element inserted between them:
for (k = 0; k < targets.size(); k++)
{
Object o = targets.targetAt(k);
Element lE = link.linkingElement();
if (o instanceof Element)
{
((Element) o).addChild(makeAnchor(lE, targets.roleAt(k)));
}
else if (o instanceof Text)
{
// OK, we're going to have to split this text node
splitText((Text) o, targets.offsetAt(k),
lE, targets.roleAt(k));
}
Once all these extra elements have been spliced into the XML spec, writing
out the annotated version is pretty easy.
I already had a large amount of Java code that converts the XML to HTML, which
is used to generate the HTML versions of the spec that you can find at
the W3C site and elsewhere.
This code works by having a Java class for each element; all I had to do was
to add a new class to handle the new spliced-in A elements,
which involved very little work aside from putting in the
-style decorations depending on the role
attribute:
System.err.println("Done. Writing target...");
xroot = XLink.currentRoot();
out = new PrintStream(new FileOutputStream("target.html"));
if (xroot != null)
printer.writeHTML(xroot, out, false);
out.close();
System.err.println("Done. Writing notes...");
Writing out the annotations is pretty easy too. They'd been placed in a vector named notes constructed earlier in the process. For each one, we open a file, dump in the HTML text (with a bit of extra decoration) and the job's done.
System.err.println("Done. Writing notes...");
for (i = 0; i < notes.size(); i++)
{
if (notes.elementAt(i) instanceof Element)
{
child = (Element) notes.elementAt(i);
title = child.attributeValue("id");
out = new PrintStream(new FileOutputStream(
"notes/" + title + ".html"));
out.println("<HTML><HEAD><TITLE>"
+ title + "</TITLE>");
...
We use the value of the id attribute for the filename; that's
why it is #REQUIRED.
|
Lines of code |
Java File |
Function |
|---|---|---|
403 |
Annotate.java |
Mainline, bookkeeping |
| 52 | Link.java | All the information describing a link from the annotation file into the XML spec |
| 276 | XLink.java | XLink processing |
| 82 | XPStep.java | XPointer step processing |
| 303 | XPointer.java | XPointer processing |
| 1116 | Total | |
This is nowhere near being a complete implementation of XLink and XPointer. It contains just enough logic to solve this particular application's problems.
From a commercial point of view, the Annotated Spec is a huge success. We should be cautious in drawing conclusions from this exercise, since the average hypertext creator is not going to regard writing 1100-plus lines of Java as a normal part of the authoring process. I think, though, that some useful lessons do emerge from this work:
Copyright © Tim Bray, 1998. All rights reserved.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.