XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Handling Atom Text and Content Constructs

Handling Atom Text and Content Constructs

December 07, 2005

The Atom Syndication Format (RFC 4287) came about in part for social reasons and in part for technical reasons. The social reasons came down to difficulties reconciling factions of existing web feed formats. One of the key technical reasons is that existing web feed formats were not clear and rigorous in specifying rules for and interpretation of embedded content and human-readable text. Atom fixes this deficiency, making things easier for those writing processing code, but it also means you should clearly understand the rules governing such constructs, and, ideally, adopt reusable libraries for the purpose. In this article I discuss the forms of text and content constructs available in Atom, and in recognized extensions, and how to process them.

Text and Content Representation Options

Atom 1.0 defines text constructs and content constructs. The Atom spec says:

A Text construct contains human-readable text, usually in small quantities. The content of Text constructs is Language-Sensitive.

Text constructs are limited in allowed representation and are used for the following Atom elements:

  • title
  • subtitle
  • summary
  • rights

Content constructs are used only in content elements. There are no limits to the allowed representation (as long as the well-formedness of the Atom document is not compromised).

Text Constructs

The simplest possible form of text construct is exemplified by the title in listing 1.

Listing 1: Default form of text construct

<title>One bold foot forward</title>

This is simply a convenient abbreviation of the form in listing 2, and Atom processors must treat listings 1 and 2 identically.

Listing 2: Explicitly unmarked-up plain text construct

<title type="text">One bold foot forward</title>

This is unmarked-up plain text content. No actual child elements are allowed, and you should not even have tunnelled markup through encoding. Atom does not strictly prohibit the form in listing 3, but it does violate the spirit of the specification. The problem is that an Atom processor should never second-guess the meaning of the type attribute, and since I implicitly use type="text" a processor will not interpret the contents as markup, as intended for the example.

Listing 3: Bogus (unsignalled) encoded markup in plain text construct

<title>One &lt;strong&gt;bold&lt;/strong&gt; foot forward</title>

If you do want to embed HTML markup as in listing 3, you should signal this fact by using type="html", as in listing 4.

Listing 4: Signalled, encoded markup in text construct

<title type="html">One &lt;strong&gt;bold&lt;/strong&gt; foot forward</title>

You can use a CDATA section to express the exact same Atom form as in listing 4, as illustrated in listing 5.

Listing 5: Signalled, encoded markup in text construct using CDATA sections

<title type="html"><![CDATA[One <strong>bold</strong> foot forward]]></title>

Listings 4 and 5 are perfectly valid Atom, but such escaping does make the embedded markup a second-class citizen, and will complicate processing (more on this later). Some people have a misperception that using CDATA sections, as in listing 5 skirts these issues, but it is very important to note that CDATA sections are nothing but syntactic sugar and do not in any way affect the core semantic issues of escaped markup. If possible, I advise you to use the final form of text construct if you wish to embed markup. Rather than tunnelling the markup into encoded text, you can use XHTML directly within the construct by using type="xhtml", as in listing 6.

Listing 6: XHTML text construct

<title type="xhtml">
  <div xmlns="http://www.w3.org/1999/xhtml">
    One <strong>bold</strong> foot forward

Yes, you must wrap the content in an XHTML div, and all that. This makes listing 6 a bit cumbersome and verbose, but it more than makes up for these shortcomings by offering a very clean layering of XML vocabularies, both of which you can be sure are not tag soup. The overhead is likely to be less imposing if you use XHTML text constructs with the typically longer content in summary. Also, if you prefer, you can declare the XHTML namespace once, on the Atom feed element, and then use the appropriate prefix (or default namespace) for all the XHTML, as in listing 7.

Listing 7: XHTML text construct using top-level namespace declaration

<feed xmlns="http://www.w3.org/2005/Atom"
<title type="xhtml">
    One <xh:strong>bold</xh:strong> foot forward

Of course, you could choose to use a prefix for Atom elements and make XHTML the default namespace, but this feels a bit backwards considering that Atom is the host vocabulary. Properly implemented processors won't care one way or another. Keeping all the namespace declarations at the top level is actually a good practice in itself, so you might consider always using the form in listing 7, at the cost of having to use prefixes on many elements.

Pages: 1, 2

Next Pagearrow