A New Old Angle on XML
August 29, 2001
One of the things that has made XML famous is the angle bracket; the pleasing symmetry
<tag> opened and the
</tag> closed. Angle brackets
are key to the particular strength of XML, a uniform and universally agreed syntax.
hardly surprising, then, that the angle bracket has a cult thing going. They're handy
logos and geeky puns (this column not excepted). Everybody loves angle brackets.
Advocates of simplification heaved a sigh of relief when XML locked down the basic syntax of markup. Its predecessor, SGML, had the scary ability to switch angle brackets out and do very weird things with the syntax. How nice that we can all agree.
Except that when you've got an angle bracket shaped hammer, everything starts looking like a nail. Surely nobody in their right mind can find pleasure in this XML-ification of a simple if ... then construct:
<prog:if test="a > b" xmlns:prog="http://myneatlang.com/"> <prog:then> ... </prog:then> <prog:else> ... </prog:else> </prog:if>
Despite the emergence of a few such programming languages with XML syntax, it's pretty clear that the angle bracket doesn't yield any of its much-vaunted "human readable" benefits here. The exception proving my rule, of course, is XSLT -- but you won't find anyone claiming that the most attractive thing about XSLT is its syntax!
Even XSLT didn't go all the way with XML syntax of course, introducing the first non-XML syntax into the XML canon: XPath. (Yes, I omit DTDs here, there's been a contract out on them ever since XML 1.0 was published.) It's pretty obvious that expanding even modest XPath expressions into an XML syntax would lead to unmanageable stylesheets. XPath's filesystem-like metaphor for navigating an XML document works pretty well. Of course, an XML syntax for XPath has been argued for, the main reason being that XML processing machinery like DOM would be able to process XPath too -- the hammer and nail argument again.
This is perhaps a not unreasonable position: for every new syntax we introduce, the situation worsens in two ways. First, we need to write code to process the new notation. Secondly, the user of the technology needs to learn the new notation. Such arguments are not overwhelming, however. The purpose of the terser XPath notation is really to help the user in the first place. So writing a small amount of code to process it can be traded off happily against the benefits to the user. Further, the pain of learning a new notation is only partially linked to the syntax. Both the verbose XML syntax I linked above and the W3C recommended XPath syntax embody the same notions: the user must learn what XPath means (dare I say the s-word?) in order to use it effectively. It's not hard to see that a verbose XML syntax could actually obscure the easy acquisition of a language's semantics.
Other W3C work has recognized this pragmatic use of little languages embedded in XML. Perhaps the best example of this is Scalable Vector Graphics, which has constructs like this:
<path d="M9.777,9.958c1.306,0,1.529,1.601,1.529,2.752c0, 0.997-0.224,2.625-1.529,2.639V9.958z M6.913,17.905h2.499c1.741,0,4.816-0.449, 4.816-5.334c0-3.201-1.757-5.251-5.014-5.251H6.913v10.586z"/>
The contents of the
d attribute describe the path of a line in a diagram. At
the time of SVG's development, some objected that the path information was not easily
processable with the XSLT hammer, as it wasn't in XML syntax. With hindsight I think
all be very grateful that the SVG Working Group did not yield to the pressure to place
paths in XML armor.
Eschewing the mere inclusion of small sections of non-XML, several people have chosen to go all the way with alternative syntaxes for XML. As I mentioned above, SGML had no hangups about leaving the house without wearing angle brackets, and one early alternative XML syntax owes much to its SGML heritage. PYX, developed by Sean McGrath, is a line-based notation for XML, which makes processing easy with regular expressions and line-oriented tools such as sed and grep. PYX borrows much from SGML's ESIS format.
PYX's main utility is in the processing of XML. There have been no serious challenges to the notion of the XML 1.0 syntax for purposes of interchange. It's generally at the processing or the document creation stage that alternative notations have their value: either as in PYX's case, to take advantage of existing infrastructure, or to make life easier and less error-prone for humans.
The simplicity or otherwise of XML schema languages has been a hot topic this year, and work in the area has yielded another instance of alternative syntax. RELAX NG ("relaxing") is the union of RELAX and TREX -- XML schema languages created by Murata Makoto and James Clark respectively -- now being developed under the aegis of an OASIS technical committee. RELAX NG offers a simpler approach than the W3C's XML Schema, though both technologies use an XML syntax (in fairness, RELAX NG also has a narrower scope than W3C XML Schema, aiming as it does at document validation only).
Not satisfied with RELAX NG's existing simplicity, James Clark recently posted an
syntax for the schema language. Clark states that his main motivation was to improve
readability of schemas. The new syntax uses familiar constructs such as
p1 | p2 to replace
<choice> p1 p2
</choice>, and is at its most valuable in complex type declarations. In one
example in his document, a 14 line definition is reduced to two lines, and without
introducing the kind of terse obscurity the Perl
programming language is famous for.
Tim Berners-Lee has also been investigating non-XML syntaxes. It is often said that one of the main obstacles to the greater success of RDF over the last two years has been its syntax. It's certainly true that if you want to say something like "This person's name is Fred", it gets a bit painful as you have to write something like
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:p="http://www.example.org/personal_details#" <rdf:Description about="http://www.example.org/people#fred"> <p:GivenName>Fred</p:GivenName> </rdf:Description> </rdf:RDF>
The W3C has a strong culture of discussion and development using IRC, which is line-based. This seems to have been a contributing factor in the development of "Notation 3" (N3), a line-based syntax for RDF. In N3 the above example might look more like:
@prefix p: <http://www.example.org/personal_details#> . <http://www.example.org/people#fred> p:GivenName "Fred" .
In his design note on N3, Berners-Lee describes it as "an academic exercise in language designed for a human-readable and scribblable language". As one of the obstacles to the deployment of widespread metadata is getting people to write it in the first place, it may be that a readily "scribblable" syntax can help.
Also in <taglines/>
Clark's RELAX NG syntax and N3 both have explicit translations to their XML representations and are intended for use before document interchange ever takes place. One problem that often faces XML programmers is being faced with embedded non-XML syntax actually at the interchange stage itself. It's this difficulty that gives rise to complaints about the non-XML nature of SVG's paths, for instance.
From a design point of view, it just seems messy to have to parse twice: once for XML, and then once for a little language embedded in attribute or text content. Life would be easier if everything could appear as SAX events or DOM nodes. Rather than forcing XML syntax to a silly degree, Simon St. Laurent has come up with an interesting solution to this problem.
In his Regular Fragmentations work,
St. Laurent has created a SAX filter (code that can insert or delete events into the
stream) that performs mappings from regular expressions into XML, projecting an XML
structure onto textual content. For example, a date written
<date>2001-08-29</date> could be translated into
for the processing application. Since this processing happens as part of the parsing
process, the expanded form is never seen in the serialized XML. This technique has
potential for simplifying XML processing, as it changes everything into a nail just
for us to bash it with our XML hammer.
XML's syntax is its strongest asset. That doesn't mean, however, that we have to take the naïve approach to getting benefit from it and bludgeon everything into angle bracket armor. There are many times when data may not be ours to change or is simply better suited to a different syntax. Whether through little languages or translators, these alternative syntaxes can in fact strengthen XML by making it more usable and understandable.