Handling Atom Text and Content Constructs
by Uche Ogbuji
|
Pages: 1, 2
Content Constructs
The simplest content construct is illustrated in listing 8.
Listing 8: Default form of content construct
<content>
The "atom:content" element either contains or links to the content of
the entry. The content of atom:content is Language-Sensitive.
</content>
Again this is effectively the same as if there was a type="text" attribute
on the content element. Once again there is the option of using type="html",
as in listing 9.
Listing 9: Embedded HTML content construct
<content type="html">
The <code>atom:content</code> element either contains or links to the
content of the entry. The content of <code>atom:content</code> is
<a href="http://www.ietf.org/rfc/rfc3066.txt">Language-Sensitive</a>.
</content>
You can also use CDATA sections, similarly to listing 5, or preferably use type="xhtml",
similarly to listing 6. You can also embed other textual formats if you specify
a type with a value starting with text/, in which
case the content must not have any child elements and must be text, with escaping
applied where necessary.
Atom content also allows arbitrary XML content, as long as you provide an
XML media type in the type attribute, with "XML media type" as
defined in RFC 3023.
Listing 10 shows how you would embed an SVG image in Atom content.
Listing 10: SVG as Atom content
<content type="image/svg+xml">
<svg xmlns="http://www.w3.org/2000/svg"
width="100px" height="100px">
<title>Itsy bitsy SVG</title>
<circle cx="40" cy="25" r="20" style="fill: black;"/>
<text x="10" y="80" fill="blue">Hello World</text>
</svg>
</content>
If you want to have content in-line while using any non-text and non-XML type, you must include it as Base64 encoded form. Listing 11 is a PNG image embedded as Atom content.
Listing 11: PNG as Atom content, embedded
<content type="image/png">
iVBORw0KGgoAAAANSUhEUgAAAB8AAAAqCAYAAABLGYAnAAAABmJLR0QA/wD/AP+gvaeTAAAACXBI
WXMAAAsTAAALEwEAmpwYAAAAB3RJTUUH1QwCBCUlRSCuygAAAetJREFUWMPt1j1IVmEUB/Dfo1mv
Bg1iSgjhZNGnFIhTDUXQ25x9LE2NbQ1NLe1tbbo1tjQEDk2BUzSITUapUejyvkQthXKfhvvcEs3g
vl55Fd4/HLgfz//8zzn3Ps85tBGhBc4oLuIYuvAV77CwW0F24x7mEbex+bSmu0rhEcz9R3SzzSXO
jnEWjRLChTUSt2X0Y7EF4cIWk4+WMLUD4cKmtPhHr1cgvp58/RNd2zy/VdFf2518lcJsBVkXNls2
85GKt2qpE+4nDlUk/gu1Mpk3Ksy8UbbsSxWKL5UVn6lQfGZP7vM9ecK1/Wxva1drez/f1UlmX8xw
HXTQQQcd7Ctkl8jGiMeJk8SeDe9uEI+Q1bfyYg/xNrGf7CaxRhzP77esPUy8soFXLwbIU4QaPuM6
YY04SDxAOI2MMJGIw8Q0z4c14lg+k4dr6MEAoUkcIvYlzgAO4nKK5CgmknjIUou8mvflrJ6uH+Vt
+k/09UR88rc64Xnq4Su4gzXiKMbxgHgBTzGcfEyirxivitH5LeF1yvIkXuLZptKdwQ98Q28Sf58y
msbd1NdPJMKHlPFCWgfnsIxmEo/NvHRxCB/xgng/ZfkFg1glTBPP4w3h+4agHhOW8TAvuVcpuBV8
yrmxl7iKVKm4mn/WDtqA3yOQKuHaSApTAAAAAElFTkSuQmCC
</content>
Finally, you can use any content externally sourced by specifying
a src attribute with the IRI (basically, "internationalized URI")
of the content and being sure to specify a type attribute that
is a proper media type and not one of the special text|html|xhtml values.
Listing 12 is similar to listing 10 except that the PNG file is external to
the Atom document.
Listing 12: PNG as Atom content, externally sourced
<content src="image.png" type="image/png"/>
Notice that the content element is empty. It must be so if you
use src in this way. Of course you could get tricksy (to put it
like Gollum) with data scheme URLs, which embed the content right in the URL
itself. I do not for a moment recommend a trick such as in listing 13, where
HTML is smuggled in even more diabolically than by using type="html",
but I'm exploring the breadth of cases, so there it is.
Listing 13: HTML content provided in a data scheme URL (not recommended)
<content src="data:text/html,%26lt%3Bi%3E3733t%2C%20d00d%26lt%3B/i%3E" type="text/html"/>
%26lt%3Bi%3E3733t%2C%20d00d%26lt%3B/i%3E is the URL quoted version
of <i>3733t, d00d</i>
Approach for Processing Atom Content
To demonstrate a likely algorithm for processing all these text and content construct possibilities, listing 14 is Python code using some hypothetical functions for parsing Atom using DOM and then emitting an XML output. In effect, it shows skeleton pivot code for the boundary between one XML processing pipeline and another where the origin stage produces Atom output and the destination is some XML format (perhaps Atom as well). A real-world example of where I have used such code is in an aggregator that combines multiple Atom feeds into a single feed (an aggregator pattern). It could also be used to generate presentation XHTML from source Atom. I chose to make it skeleton code so you can feel free to substitute the XML generation toolkit of your choice, and so the algorithm can be copied more transparently to other languages such as ECMAScript, Ruby, or even XSLT.
Listing 14: Skeleton Python code for processing Atom input to produce XML output
import base64
from xml.sax.saxutils import unescape
def handle_text_construct(node):
#Merge adjacent text nodes
node.normalize()
text_type = node.getAttributeNS(None, u"type")
if text_type in [u"", u"text"]:
write_cdata(node.firstChild.data)
elif text_type == u"html":
tagsoup = unescape(node.firstChild.data)
tidied = tidy(tagsoup)
write_literal_xml(tidied)
elif text_type == u"xhtml":
write_literal_xml(node.firstChild.data)
else:
raise TypeError("Illegal text construct type")
return
def handle_content(node):
content_type = node.getAttributeNS(None, u"type")
content_src = node.getAttributeNS(None, u"src")
if content_src:
#For example write an XHTML object start tag
write_ext_reference(src, type)
return
#Atom built-in types are handled same way as text constructs
if text_type in [u"", u"text", u"html", u"xhtml"]:
handle_text_construct(node)
return
node.normalize()
#Check the XML type case before the text type case
if text_type.endswith("/xml") or text_type.endswith("+xml"):
write_literal_xml(node.firstChild.data)
elif text_type.startswith(u"text/"):
write_cdata(node.firstChild.data)
else:
#You may choose to handle such by
#duplicating this construct, creating an entity with NDATA,
#Using a reference with data type URL, or other means
content = base64.decodestring(node.firstChild.data)
handle_foreign_content(content)
return
Using Data URLs in HTML Output
While I do not recommend tunneling tag-soup content in data scheme URLs when
expressed in a non-tag-soup format such as Atom, such URLs can be a limited
solution for one problem I've encountered in Atom processing. If you want to
tunnel tag soup to an output that can handle it (say a web browser), and the
browser understands data scheme URLs, you can skip the decoding then tidying
step for processing type="html" and just URL encode the escaped HTML into a data URL in an object element for output.
Yes, this is a very suspicious hack, but it illustrates some of the desperate measures I have had to resort to when working with Atom given the realities of ubiquitous tag soup. Specifically I was trying to write a little personal feed viewer using XSLT so that I could render feed contents in Firefox. Writing object elements with data URLs was the easiest way to tame the escaped tag soup I was getting from upstream feeds. I would never have done such a thing if the next stage in the processing pipeline was what I consider a proper XML stage, but since it was directly to a web browser at the end of the line, I took the liberty. The relevant bit of XSLT was as in listing 15.
Listing 15: Sample XSLT for hacking escaped HTML into an HTML object with data URL
<xsl:transform version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:str="http://exslt.org/strings"
>
<xsl:output method="html"/>
<xsl:template match="atom:content[@type='html']">
<object type="text/html" height="100" width="100"
href="data:text/html,{str:encode-uri(string(.), false())}">
Unfortunately, your browser does not support data scheme URLs,
so you cannot view this embedded content
</object>
</xsl:template>
</xsl:transform>
At present data URLs are supported in Mozilla, Opera, Safari, and Konqueror. So far Internet Explorer does not support data URLs, which is a huge damper, but there are a lot of user requests for the feature, so it might make a surprise appearance in IE7. See, for example, comments in the IE team blog entry URLs in Internet Explorer 7.
Atomic Text Clean-up
For more on why you should avoid escaped HTML in Atom documents, see Escaped Markup Considered Harmful by Norm Walsh here on XML.com, and his follow-up Escaped Markup: What To Do Instead. Atom is of the XML family, and it's always best to keep as much as possible in the XML layer. If you do, processing will be easier and your data will be cleaner. Of course, since it's not always easy to keep the gloves on after more than a decade of tag soup on the Web, Atom makes it possible to deal with messy content without devolving completely to the chaotic content representation that marks many other web feed formats. In Atom, you at least have to properly declare your mess.
By the way, if this article interested you I'd like to invite you to join the Atom IRC channel on Freenode (#atom on irc.freenode.net), which I revived last month. We've settled down to a few regulars with people often popping in to ask quick questions or announce work in progress, but the more the merrier. Atom 1.0 is just out of the shrink-wrap and the Atom Publishing Protocol -- featured in Joe Gregorio's Restful Web column this week, Catching Up with the Atom Publishing Protocol -- is advancing towards production, so it's a great time to discuss user and implementation details in a friendly forum.
Share your experience in our forums.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- Text is text
2005-12-08 20:14:35 philringnalda [Reply]
Could you *please* correct the text around Listing 3? You are making the same mistake that several popular aggregators have already made, thinking that Atom text constructs are like RSS, where you have to make your best guess about whether the content of an element is escaped HTML or not.
Listing 3 doesn't violate the spirit of the spec, it *is* the spirit: it is plain text, and the character entity references for less-than and greater-than should be displayed as unreplaced character entity references. It is an example of HTML source, and if you and aggregators persist in claiming that it *is* HTML source, rather than an example, then there will be absolutely no way to use the less-than character in type="text".
- Text is text
2005-12-08 20:35:03 Uche Ogbuji [Reply]
There is nothing to "fix". We apparently just disagree.
If you were to display the title of a feed as in listing 3 using your "text is text" philosophy (I think that especially in the age of markup languages this is a huge oversimplication) you would basically have Grandma May User staring at a mess of greater than and less than signs with words that have no grammatical reason for being there.
The spirit of the spec is that a text construct is intended for human reading. Unmarked, escaped tags are not intended for human reading. Sure there are cases where ampersands and less than signs appear in text that *is* meant for human reading, and I have no problem with the escaping in such cases.
As for listing 3, it is a silly thing to do and an unnecessary thing (you can fix it by just using "type="html" or better yet "type="xhtml") so I stand by my admonition for developers to avoid it.
- There is text and there is text and...
2005-12-08 21:29:28 Uche Ogbuji [Reply]
No one said that anyone should decode a text construct of type text. My point is precisely that it should not be decoded in any way. And as I said in the introduction to the article Atom's purpose is indeed to remove such ambiguity. So I don't understand why you want to reintroduce the ambiguity by sneaking in text. If you want to embed markup, say so in your Atom file. That's what the type attribute is for.
And all specs have a spirit. All formal communication has a spirit. It's a consequence of the fact that no human communication is unambiguous. A practical manifestation of the spirit of Atom is the fact that the Atom validator issues warnings. Such a warning can be thought of as something that violates the spirit rather than the letter of a spec.
If you have different advice to offer people with regard to type="text", you are entitled. I never told anyone markup in type="text" is illegal. I merely offered my advice to avoid the practice. I stand by that advice.
- There is text and there is text and...
2005-12-08 22:25:20 philringnalda [Reply]
Certainly, avoiding markup in type="text" is good advice, but it should be good advice "because your markup will be displayed rather than being interpreted" rather than "because it's the wrong thing to to do." Having misstated my position and put it too strongly, I don't expect to persuade you at this point, but I do still believe that this puts type="text" in exactly the same position as title in RSS: you simply cannot use a less-than character in it with any hope of it being correctly interpreted.
- There is text and there is text and...
2005-12-08 22:54:52 Uche Ogbuji [Reply]
OK. I think this means that we don't disagree as strongly as it first seemed, but that you think I wasn't clear enough in the article itself on the reasons for avoiding embedded markup in type="text". I can accept that. I felt that my discussion and references elsewhere in the article on the subject of escaped markup would make the reasons for my advice reasonably clear, but I was probably too laconic in the text immediately leading up to listing 3. If so, I hope this thread helps tease out the matter sufficiently for readers.
Thanks.
- There is text and there is text and...
- There is text and there is text and...
- Text is text
2005-12-08 21:02:24 qdn [Reply]
Yeah, not so much, Uche -- I can't say I agree with your interpretation at all. First, there's really no "spirit" of the spec; the Atom spec is very overt when it explains that text is for literal display of text, and even uses an example that demonstrates that an ampersand-encoded less-than sign should remain as such. Second, your assumption that "unmarked, escaped tags are not meant for human reading" is a bit presumptuous -- they're certainly meant for reading in many given contexts, like in programming, data streams, and the like. One of the largest purposes of Atom -- you can read this over and over in the discussions that happened right out in the open -- is to take away the ambiguity of situations like this by providing explicit types that content providers can specify, and then telling Atom consumers that they must abide by those types if they wish to faithfully represent the content. If a <title> with a type of text has an escaped less-than sign in it, then the Atom consumer must not decode it; anything less is making an assumption that's just plain wrong.
- Text is text
2005-12-08 21:39:29 Uche Ogbuji [Reply]
My reply "There is text and there is text and..." above was meant to be in response to the above comment.
- Text is text
- There is text and there is text and...
- Text is text
- Encoding of entities
2005-12-08 11:51:31 chneukirchen [Reply]
Could you maybe clarify what happens to entities when they are included in the content? Several RSS readers I experimented it needed double encoding to show correctly (i.e. < although I use namespaces and XHTML.)
- Encoding of entities
2005-12-08 20:01:43 Uche Ogbuji [Reply]
If you avoid type="html" you should not need double escaping in Atom. SImple escaping should do. If you do use type="html" you will need double escaping to represent characters such as less-than and ampersand.
I recommend avoiding type="html" if possible, so if you can follow this advice you should be able to avoid escaping puzzles.
- Encoding of entities
2005-12-08 11:52:46 chneukirchen [Reply]
I meant &lt, of course. Escaping is the modern lash of humanity.
- Encoding of entities
