Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Escaped Markup Considered Harmful

by Norman Walsh
August 20, 2003

XML is pretty simple. There's plenty of complexity to be found if you go looking for it: if you want, for example, to validate or transform or query it. But elements and attributes in well formed combinations have become the basis for an absolutely astonishing array of projects.

Recently I've encountered a design pattern (or antipattern, in my opinion) that threatens the very foundation of our enterprise. It's harmful and it has to stop.

A Little Historical Background

It was not always obvious that the idea of well formed markup, and the draconian approach to error detection that the XML 1.0 Recommendation requires, was going to catch on. Well, it wasn't obvious to me, anyway; I won't speak for anyone else.

The technical merits of well formedness are easily understood. Well formedness allows a parser to discover, easily and unambiguously, the logical structure of an XML file. Before XML, we had SGML, which has all sorts of rules for markup minimization. They made writing a parser hard. Very hard. And SGML had its cousin HTML. Most HTML documents weren't really valid SGML documents; beyond all of parsing issues that full SGML support would have brought, the applications that consumed HTML had completely ad hoc rules for recovering from markup errors.

All of this complexity for parser writers and ambiguity in interpretation of broken markup had a benefit. It made hand authoring of markup a lot easier. In SGML you could leave off the quotes around some attribute values, omit start and end tags in some places, and rely on the devious little SHORTREF mini-language, if you were so inclined. And in HTML, you could throw just about any tag soup at the browser and it'd do something. With a little random fiddling, you could probably get it to do something that looked right, at least in some browsers.

XML came along and said, "Nope. Too hard, too costly, too difficult. You're going to do your markup just like this, with almost no minimization, and if you don't get it exactly right, applications aren't allowed to recover from your errors. If it isn't well formed, it isn't XML."

And for a moment, we held our collective breath.

The moments ticked by. The right vendors agreed to support XML, the necessary folks in the user community looked at the possibility of a future where powerful applications were easy to write and agreed that the trade-off in markup ease was well made. XML passed the first, perhaps most important hurdle. It was off and running.

A few years later there are growing pains. We can all point to this specification or that one and claim that it would have been better if it'd been done some other way. But I think few would argue that it hasn't been a success story on the whole. As I said, now we've got an absolutely astonishing array of powerful, open, flexible, adaptable tools at our disposal.

And we have them because XML must be well formed.

Thus it came as a surprise to me when I discovered that the RSS folks were supporting a form of escaped markup. Webloggers often publish a list of their recent entries in RSS and online news sites often publish headlines with it. Like most XML technologies, there's enough flexibility in it to suit a much wider variety of purposes than I could conveniently summarize here.

Surprise became astonishment when I discovered that the folks working on the successor to RSS weren't going to explicitly outlaw this ugly hack. When I discovered that this hack was leaking into another XML vocabulary, FOAF, I became outright concerned.

What is Escaped Markup?

Escaped markup is just what it sounds like: markup that has been escaped so that it isn't markup anymore. If you write XML documents that have less-than signs or ampersands in content, you're already familiar with escaped markup.

In RSS, it often looks like this:

<description><![CDATA[
Some description of an article about
<a href="http://www.w3.org/TR/REC-xml">XML</a> that
contains a link and a <br> element.]]>
</description>

It is important to realize that this is precisely the same as:

<description>
Some description of an article about
&lt;a href="http://www.w3.org/TR/REC-xml"&gt;XML&lt;/a&gt; that
contains a link and a &lt;br&gt; element.
</description>

The notion that CDATA elements convey some special, literalist semantics on the escaped markup is incorrect. While it is technically possible for an application to distinguish which form of escaping was used, it would be wrong to establish meaning based on the form. CDATA escaping is generally indistinguishable from other forms of escaping.

Now there's nothing wrong with escaped markup, as long as it means what it says. Namely:

Some description of an article about <a href="http://www.w3.org/TR/REC-xml">XML</a> that contains a link and a <br /> element.

But, perversely, most RSS applications render that markup like this:

Some description of an article about XML that contains a link and a
element.

A convention has developed that says the contents of at least some and perhaps all elements in RSS are "unescaped" and then rendered. This opens a horrible back door in the whole XML markup picture.

Escaped Markup Doesn't Work

Related Reading

Content Syndication with RSS
By Ben Hammersley

There appear to be two arguments in favor of escaped markup:

  1. Aggregators are using XML, in the form of RSS, to combine data sources together. Aggregators are tools or companies that build RSS feeds for a wide variety of sources. You might, for example, subscribe to a feed that shows the top ten news stories from selected major news outlets.

    The aggregators might argue that they're just using XML as a transport protocol and have no control over the actual content. The content that they're aggregating may or may not be well formed so they have to do something with the markup. There's a further argument that they don't have any interest in the actual content, that it's just shuffled off to some other application for rendering, and that it's better and more efficient to store the content as opaque text nodes.

    I don't think these arguments come close to justifying the solution that's been adopted:

    1. Escaping markup, particularly with CDATA sections, just doesn't work. There are other things that might be wrong that would make the documents not well formed. There are Unicode characters that are forbidden, there are encoding issues for the characters that are allowed, and there are sequences of characters that must be avoided. (e.g., "]]>"). Not to mention the fact that CDATA sections don't nest.

    2. There are better ways of escaping content. First of all, if the content you encounter is well formed XML, no escaping is necessary. If it isn't well formed XML, then it must be HTML. No application is allowed to accept a document that purports to be XML but is not well formed. There are well understood ways to turn HTML into XHTML (or well formed XML). I'd even prefer stripping all the markup entirely to this escaped markup "solution".

      The argument about opacity doesn't fly either. Just because some applications don't care about the content of the aggregated feed is a poor excuse for putting it inside a black box that can't be opened by any rational XML application.

    3. If it's really important to escape the markup, if it's impractical to convert it to well formed XML, or the penalty of parsing the nested markup is too expensive, use base64 encoding.

      That would have two distinct advantages: first, it would actually work, which is always a nice feature, since it would handle arbitrary characters; second, it would very clearly not be a format designed for human authoring.

      I think the most dangerous part of this whole escaped markup kludge is that it encourages naive authors and programmers to adopt this style in other applications.

  2. Escaped markup allows authors to put HTML and other content into elements where the schema or DTD says that only text is allowed.

    I'm sorry: an obvious, compelling, and irrefutable argument against allowing escaped markup is that it allows authors to put HTML and other content into elements where the schema or DTD says that only text is allowed.

Escaped Markup Is Harmful

The idea of escaping markup goes against the fundamental grain of XML. If this hack spreads to other vocabularies, we'll very quickly find ourselves mired in the same bugward-compatible tag soup from which we have struggled so hard to escape.

And evidence suggests that it's already spreading. Not long ago, the question of escaped markup turned up in the context of FOAF. The FOAF specification condones no such nonsense, but one of the blogging tools that produces FOAF reacted to a users insertion of HTML markup into the "bio" element by escaping it. The tool vendor in question was quickly persuaded to fix this bug.

Escaped Markup Must Stop

There is clear evidence that the escaped markup design will spread if it isn't checked. If it spreads far enough before it's caught, it will become legacy. Some vendors will be forced to continue to support this abomination by simple economics. And it won't be their fault, it'll be ours for not killing the virus before it could spread.


Comment on this articleAgree, disagree? Share your opinions on this article in our forum.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Titles Only Newest First
  • That's not always true.
    2005-10-18 08:10:12 DCGregoryA [Reply]

    As an example, we use built in documenting features of Visual Studio to have duplicate copies of our updated source code. Wrapping a CDATA around it is the only effective way to handle it, because when its rendered if you use html escape tags they will be broken when displayed.


    Here's an example, I have some code that creates an HTML string that has a link to a javascript function that performs some action. It might look like :


    html += "(less than sign)a href=", etc etc etc, whereas the less than sign is an actual less than sign. When I put that into the example/code section, to be written as XML, I'd have to escape it so as to not form bad XML. If I use the HTML markup tags, when those are displayed in my HTML documents, its going to show the link instead of showing that I'm constructing a link. In that case, I want to ensure the documenting tools we use know that's pure data.


    That's just one simple example, considering the variety of uses for XML I'm sure when you're talking about transmitting documents and datasets as XML strings in an XML document that may have many other elements using XML strings, CDATA is a good way to show that. Using some ambiguious markup definition that isn't standardized is an ineffective and largely unneccessary way of doing things.

  • We are pawns in this horrible game
    2005-06-21 08:41:10 pgregg [Reply]

    To all the people demanding a workaround to escaping: There is nothing wrong with CDATA. It is extremely useful, as long as the XML parser considers it nothing more than syntactic shorthand for &-escaping. I don't believe this article was railing against XML writers who do escaping normally. It is calling out the XML vendors who support lazy XML authors who didn't even want to bother putting their garbage inside CDATA. Now XML application vendors have to support pre-emptive escaping. That leaves the rules around CDATA totally up in the air.


    This creates huge problems, like the one I'm having right now. The XML parser on our test web server considers CDATA as simple escaping shorthand (correctly!). However, the XML parser on our new production server, for whatever reason, insists on escaping twice no matter what -- even when I set disable-output-escaping="yes". So even though I put <![CDATA[  ]]> in my XSLT file, it always gets rendered in the browser as   , not the simple two spaces that it ought to represent in HTML source. The only workaround I have now to do escaping is the longhand way: <xsl:text disable-output-escaping="yes">&nbsp;&nbsp;</xsl:text> . That's only 10 times as many characters. So basically because of excessively lazy XML app vendors who sympathized with excessively lazy XML authors, now I can't even use CDATA. They've ruined it for all of us.


    There is no workaround -- you just have to keep doing what you're doing, and we all must hope that XML vendors withdraw from supporting this false laziness. See also http://carey.geek.nz/doc/xslt-cdata-escaping/ for additional commentary on what CDATA is really for.

  • One more time
    2003-09-21 17:01:24 Norman Walsh [Reply]

    I thought I'd been pretty clear about what I thought the alternatives were. In brief:


    http://norman.walsh.name/2003/09/18/unescmarkup


  • Hmmm
    2003-09-16 11:49:31 Julian Bond [Reply]

    After re-reading your article and additional comments I think I need to explain a little more.


    I *like* html in the RSS description section. I don't think I'm alone. I, like many other RSS authors, *want* to put html in the description field. If the spec, DTD or Schema says this field should be text only then it's wrong in the sense that it doesn't allow for common practice or what people want to use it for.


    So now we have a problem. How to embed arbitrary and possibly malformed html and character codes into a field in an XML document? Ideally, I would just use a toolkit that cleaned it up and applied appropriate encoding so that I could get on and write application function. Unfortunately for my particular poison of PHP, I've never found such a toolkit. The alternative would be to have an extension to the spec that provided for an attachment suitably encoded.


    I'm willing to accept the premise that escaped embedded markup is harmful. But I need an alternative that provides a solution... Have you any suggestions?

  • Follow up online
    2003-09-16 06:52:07 Norman Walsh [Reply]

    I wrote some follow-up based on these comments.
    See http://norman.walsh.name/2003/09/16/escmarkup


  • Instead of embedding?
    2003-09-05 06:28:15 Richard Pinneau [Reply]

    I don't want to embed markup. But I do need to achieve in-line effects and I'm not *getting* any workaround from this article. Here's the deal:
    Ok, I'm a complete newbie to XML - I've been fascinated by its potential, but can't see how to begin practicing with its power for the type of use for the type of RichTextFormat uses that I store in FileMakerPro - at least not without escaping embedded markup - so this article sounded most interesting. However it leaves me crippled and unable to see how to do what I want to accomplish. Maybe somebody can enlighten me.


    Suppose I've got a db for which a sample .txt export of a record might look like:


    "Smith, Ron A","Understanding XML","2001","I really found this book helpful"


    I understand that this could be in a XML db as:


    <auth>Smith, Ron A</auth><tit>Understanding XML</tit><dt>2001</dt><mynotes>I REALLY found this book helpful"</mynotes>


    I don't want to make the word 'REALLY' into another *FIELD* in a databases (XML or otherwise). I want it to be displayed as italic when it is display (e.g., in a browser).


    I'd be sold on XML if someone could show me what is the PROPER way of doing this in XML - and ESPECIALLY how to CONCEPTUALIZE what I'm supposed to be doing with this.


    Remember: I don't want *another field* (I'd like to be able to move this back and forth between XML db and MySQL and FMPro) - I want to have a way to STYLIZE, etc. (italic, emphasize, embolden, etc) text WITHIN a FIELD.


    Obviously I'm just not "getting" XML, or else those who build XML dbs think that one should not desire to stylize text?
    Thanks for enlightening a total newbie.

  • Great idea in theory
    2003-08-25 04:01:56 Jukka-Pekka Keisala [Reply]

    I am also hate CDATA blogs but as working in content management systems I find it very hard to do it in any other way when people who writes content into page doesn't really care if tags are closed or not. Therefore CDATA saves me from malformated word HTML and other wysivyg editors.


  • Sweet!
    2003-08-25 03:04:10 Oleg Tkachenko [Reply]

    +1, it's definitely a virus and must be stoped.

  • What then?
    2003-08-21 13:03:42 John Vance [Reply]

    Can someone please give me a pointer on *how* to embed html-elements in RSS otherwise?


    A subset of tags should be at least supported.


    Also most sites use HTML 4 and don't want to switch to XHTML so they don't loose support for older browsers. "<br/>" is not acceptable in these environments. (For example "Server-on-desktop" News Aggregators, which have to incorporate news supplied via RSS in a desktop-served HTML4 website).


    So embedding mark-up by having 2 levels, a "surface structure" (RSS) and a "deep structure" (HTML4 embedded), is the only way to solve these problems?


    Any ideas out there?

    • What then?
      2003-08-22 08:18:55 Joel Bennett [Reply]

      the whole point of xhtml (1.0 transitional, anyway) is that it's supposed to be backwards compatible. If you use
      (note the space) I don't see what browsers are going to give you trouble with that.


      The fact is, it's ridiculous to try to use XML just to carry old-style html. If you feel you must put formatted code in there, you need to use a namespace, and valid xhtml. If you think you need to use old html 4.01 ... then you haven't put enough thought into what your RSS feed is for.


      I gotta agree that the way to fix this is for the aggregators to parse it 'right' ... when lockergnome starts getting more emails about the strange html entities in their xml feeds, maybe then they'll fix it.

  • What then?
    2003-08-21 13:01:10 John Vance [Reply]

    Can someone please give me a pointer on *how* to embed html-elements in RSS otherwise?


    A subset of tags (, ,

    , <span>,...) should be at least supported.


    Also most sites use HTML 4 and don't want to switch to XHTML so they don't loose support for older browsers. "<br/>" is not acceptable in these environments. (For example "Server-on-desktop" News Aggregators, which have to incorporate news supplied via RSS in a desktop-served HTML4 website).


    So embedding mark-up by having 2 levels, a "surface structure" (RSS) and a "deep structure" (HTML4 embedded), is the only way to solve these problems?


    Any ideas out there?



  • So how about some alternatives
    2003-08-21 01:20:35 Julian Bond [Reply]

    I'm not arguing that embedded markup isn't wrong. I'm looking for an alternative. You write "The aggregators might argue that they're just using XML as a transport protocol and have no control over the actual content." This is a completely valid argument isn't it? If we consider RSS as purely a transport mechanism and that we actually want the markup to come out the other end intact, then what you're really arguing is that we shouldn't use XML as a transport mechanism. Exactly the same problem occurs with more formal XML-based transport mechanisms such as xmlrpc or SOAP.


    And let's not forget that RSS is the single most successful XML format ever and escaping html in the description tag has been with us since the beginning. It may be wrong, it may be ugly and it may well be harmful, but dammit, it works.


    Now look at it from the POV of the humble website developer building a community news site. You have limited control over what the users type in. You have to jump through hoops to do your best to produce well formed XML. Once in a while it goes wrong. Now who's to blame? The developer? The RSS specs? XML?


    And last time I looked CDATA was actually part of the XML spec. So the feed still validates as XML. So what is your problem exactly?


    And btw, FOAF is not an XML vocabulary, it's an RDF vocabulary. ;-)

  • what i do
    2003-08-21 01:02:44 bryan rasmussen [Reply]

    I have to of course accept escaped markup that comes in, I do not render it as markup however, I strip it out. Then I accept the text.


    If enough people do this then the utility of escaped markup disappears, and instead what will happen is namespaced xhtml inside these elements.


    This is already happening with some of the more technical sites providing rss feeds, the owners of which have realized the problems involved with said escaping.




  • escaped markup unavoidable
    2003-08-21 00:55:59 bryan rasmussen [Reply]

    The problem of course is that the possibility of escaped markup is unavoidable, any text format has to provide a method for escaping characters meaningful to that format. It does however seem rather strange that with the xml format so many people are just wildly running around escaping the format, and then reformatting the escaped. Can anyone point to any other format where this has been a common problem?
    I can't think of any.