Menu

Escaped Markup Considered Harmful

August 20, 2003

Norman Walsh

XML is pretty simple. There's plenty of complexity to be found if you go looking for it: if you want, for example, to validate or transform or query it. But elements and attributes in well formed combinations have become the basis for an absolutely astonishing array of projects.

Recently I've encountered a design pattern (or antipattern, in my opinion) that threatens the very foundation of our enterprise. It's harmful and it has to stop.

A Little Historical Background

It was not always obvious that the idea of well formed markup, and the draconian approach to error detection that the XML 1.0 Recommendation requires, was going to catch on. Well, it wasn't obvious to me, anyway; I won't speak for anyone else.

The technical merits of well formedness are easily understood. Well formedness allows a parser to discover, easily and unambiguously, the logical structure of an XML file. Before XML, we had SGML, which has all sorts of rules for markup minimization. They made writing a parser hard. Very hard. And SGML had its cousin HTML. Most HTML documents weren't really valid SGML documents; beyond all of parsing issues that full SGML support would have brought, the applications that consumed HTML had completely ad hoc rules for recovering from markup errors.

All of this complexity for parser writers and ambiguity in interpretation of broken markup had a benefit. It made hand authoring of markup a lot easier. In SGML you could leave off the quotes around some attribute values, omit start and end tags in some places, and rely on the devious little SHORTREF mini-language, if you were so inclined. And in HTML, you could throw just about any tag soup at the browser and it'd do something. With a little random fiddling, you could probably get it to do something that looked right, at least in some browsers.

XML came along and said, "Nope. Too hard, too costly, too difficult. You're going to do your markup just like this, with almost no minimization, and if you don't get it exactly right, applications aren't allowed to recover from your errors. If it isn't well formed, it isn't XML."

And for a moment, we held our collective breath.

The moments ticked by. The right vendors agreed to support XML, the necessary folks in the user community looked at the possibility of a future where powerful applications were easy to write and agreed that the trade-off in markup ease was well made. XML passed the first, perhaps most important hurdle. It was off and running.

A few years later there are growing pains. We can all point to this specification or that one and claim that it would have been better if it'd been done some other way. But I think few would argue that it hasn't been a success story on the whole. As I said, now we've got an absolutely astonishing array of powerful, open, flexible, adaptable tools at our disposal.

And we have them because XML must be well formed.

Thus it came as a surprise to me when I discovered that the RSS folks were supporting a form of escaped markup. Webloggers often publish a list of their recent entries in RSS and online news sites often publish headlines with it. Like most XML technologies, there's enough flexibility in it to suit a much wider variety of purposes than I could conveniently summarize here.

Surprise became astonishment when I discovered that the folks working on the successor to RSS weren't going to explicitly outlaw this ugly hack. When I discovered that this hack was leaking into another XML vocabulary, FOAF, I became outright concerned.

What is Escaped Markup?

Escaped markup is just what it sounds like: markup that has been escaped so that it isn't markup anymore. If you write XML documents that have less-than signs or ampersands in content, you're already familiar with escaped markup.

In RSS, it often looks like this:

<description><![CDATA[

Some description of an article about

<a href="http://www.w3.org/TR/REC-xml">XML</a> that

contains a link and a <br> element.]]>

</description>

It is important to realize that this is precisely the same as:

<description>

Some description of an article about

&lt;a href="http://www.w3.org/TR/REC-xml"&gt;XML&lt;/a&gt; that

contains a link and a &lt;br&gt; element.

</description>

The notion that CDATA elements convey some special, literalist semantics on the escaped markup is incorrect. While it is technically possible for an application to distinguish which form of escaping was used, it would be wrong to establish meaning based on the form. CDATA escaping is generally indistinguishable from other forms of escaping.

Now there's nothing wrong with escaped markup, as long as it means what it says. Namely:

Some description of an article about <a href="http://www.w3.org/TR/REC-xml">XML</a> that contains a link and a <br /> element.

But, perversely, most RSS applications render that markup like this:

Some description of an article about XML that contains a link and a
element.

A convention has developed that says the contents of at least some and perhaps all elements in RSS are "unescaped" and then rendered. This opens a horrible back door in the whole XML markup picture.

Escaped Markup Doesn't Work

There appear to be two arguments in favor of escaped markup:

  1. Aggregators are using XML, in the form of RSS, to combine data sources together. Aggregators are tools or companies that build RSS feeds for a wide variety of sources. You might, for example, subscribe to a feed that shows the top ten news stories from selected major news outlets.

    The aggregators might argue that they're just using XML as a transport protocol and have no control over the actual content. The content that they're aggregating may or may not be well formed so they have to do something with the markup. There's a further argument that they don't have any interest in the actual content, that it's just shuffled off to some other application for rendering, and that it's better and more efficient to store the content as opaque text nodes.

    I don't think these arguments come close to justifying the solution that's been adopted:

    1. Escaping markup, particularly with CDATA sections, just doesn't work. There are other things that might be wrong that would make the documents not well formed. There are Unicode characters that are forbidden, there are encoding issues for the characters that are allowed, and there are sequences of characters that must be avoided. (e.g., "]]>"). Not to mention the fact that CDATA sections don't nest.

    2. There are better ways of escaping content. First of all, if the content you encounter is well formed XML, no escaping is necessary. If it isn't well formed XML, then it must be HTML. No application is allowed to accept a document that purports to be XML but is not well formed. There are well understood ways to turn HTML into XHTML (or well formed XML). I'd even prefer stripping all the markup entirely to this escaped markup "solution".

      The argument about opacity doesn't fly either. Just because some applications don't care about the content of the aggregated feed is a poor excuse for putting it inside a black box that can't be opened by any rational XML application.

    3. If it's really important to escape the markup, if it's impractical to convert it to well formed XML, or the penalty of parsing the nested markup is too expensive, use base64 encoding.

      That would have two distinct advantages: first, it would actually work, which is always a nice feature, since it would handle arbitrary characters; second, it would very clearly not be a format designed for human authoring.

      I think the most dangerous part of this whole escaped markup kludge is that it encourages naive authors and programmers to adopt this style in other applications.

  2. Escaped markup allows authors to put HTML and other content into elements where the schema or DTD says that only text is allowed.

    I'm sorry: an obvious, compelling, and irrefutable argument against allowing escaped markup is that it allows authors to put HTML and other content into elements where the schema or DTD says that only text is allowed.

Escaped Markup Is Harmful

The idea of escaping markup goes against the fundamental grain of XML. If this hack spreads to other vocabularies, we'll very quickly find ourselves mired in the same bugward-compatible tag soup from which we have struggled so hard to escape.

And evidence suggests that it's already spreading. Not long ago, the question of escaped markup turned up in the context of FOAF. The FOAF specification condones no such nonsense, but one of the blogging tools that produces FOAF reacted to a users insertion of HTML markup into the "bio" element by escaping it. The tool vendor in question was quickly persuaded to fix this bug.

Escaped Markup Must Stop

There is clear evidence that the escaped markup design will spread if it isn't checked. If it spreads far enough before it's caught, it will become legacy. Some vendors will be forced to continue to support this abomination by simple economics. And it won't be their fault, it'll be ours for not killing the virus before it could spread.