Handling Atom Text and Content Constructs
The Atom Syndication Format (RFC 4287) came about in part for social reasons and in part for technical reasons. The social reasons came down to difficulties reconciling factions of existing web feed formats. One of the key technical reasons is that existing web feed formats were not clear and rigorous in specifying rules for and interpretation of embedded content and human-readable text. Atom fixes this deficiency, making things easier for those writing processing code, but it also means you should clearly understand the rules governing such constructs, and, ideally, adopt reusable libraries for the purpose. In this article I discuss the forms of text and content constructs available in Atom, and in recognized extensions, and how to process them.
Text and Content Representation Options
Atom 1.0 defines text constructs and content constructs. The Atom spec says:
A Text construct contains human-readable text, usually in small quantities. The content of Text constructs is Language-Sensitive.
Text constructs are limited in allowed representation and are used for the following Atom elements:
Content constructs are used only in
content elements. There are
no limits to the allowed representation (as long as the well-formedness of
the Atom document is not compromised).
The simplest possible form of text construct is exemplified by the title in listing 1.
Listing 1: Default form of text construct
<title>One bold foot forward</title>
This is simply a convenient abbreviation of the form in listing 2, and Atom processors must treat listings 1 and 2 identically.
Listing 2: Explicitly unmarked-up plain text construct
<title type="text">One bold foot forward</title>
This is unmarked-up plain text content. No actual child elements are allowed,
and you should not even have tunnelled markup through encoding. Atom does not strictly prohibit the form in listing 3, but it does
violate the spirit of the specification. The problem is that an Atom
processor should never second-guess the meaning of the
type attribute, and since I implicitly use
type="text" a processor will not interpret the contents as
markup, as intended for the example.
Listing 3: Bogus (unsignalled) encoded markup in plain text construct
<title>One <strong>bold</strong> foot forward</title>
If you do want to embed HTML markup as in listing 3, you should signal this
fact by using
type="html", as in listing 4.
Listing 4: Signalled, encoded markup in text construct
<title type="html">One <strong>bold</strong> foot forward</title>
You can use a
CDATA section to express the exact same Atom form as in listing
4, as illustrated in listing 5.
Listing 5: Signalled, encoded markup in text construct using
<title type="html"><![CDATA[One <strong>bold</strong> foot forward]]></title>
Listings 4 and 5 are perfectly valid Atom, but such escaping does make the
embedded markup a second-class citizen, and will complicate processing (more
on this later). Some people have a misperception that using
CDATA sections, as in listing 5 skirts these issues, but it is very important to note that
CDATA sections are nothing but syntactic sugar and do not in any way affect
the core semantic issues of escaped markup. If possible, I advise you to use
the final form of text construct if you wish to embed markup. Rather than tunnelling
the markup into encoded text, you can use XHTML directly within the construct
type="xhtml", as in listing 6.
Listing 6: XHTML text construct
<title type="xhtml"> <div xmlns="http://www.w3.org/1999/xhtml"> One <strong>bold</strong> foot forward </div> </title>
Yes, you must wrap the content in an XHTML
div, and all that.
This makes listing 6 a bit cumbersome and verbose, but it more than makes up
for these shortcomings by offering a very clean layering of XML vocabularies,
both of which you can be sure are not tag soup. The overhead is likely to be
less imposing if you use XHTML text constructs with the typically longer content
summary. Also, if you prefer, you can declare the XHTML namespace
once, on the Atom
feed element, and then use the appropriate prefix
(or default namespace) for all the XHTML, as in listing 7.
Listing 7: XHTML text construct using top-level namespace declaration
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:xh="http://www.w3.org/1999/xhtml"> ... <title type="xhtml"> <xh:div> One <xh:strong>bold</xh:strong> foot forward </xh:div> </title> ...
Of course, you could choose to use a prefix for Atom elements and make XHTML the default namespace, but this feels a bit backwards considering that Atom is the host vocabulary. Properly implemented processors won't care one way or another. Keeping all the namespace declarations at the top level is actually a good practice in itself, so you might consider always using the form in listing 7, at the cost of having to use prefixes on many elements.
Pages: 1, 2