XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Handling Atom Text and Content Constructs
by Uche Ogbuji | Pages: 1, 2

Content Constructs

The simplest content construct is illustrated in listing 8.

Listing 8: Default form of content construct


<content>
The "atom:content" element either contains or links to the content of
the entry.  The content of atom:content is Language-Sensitive.
</content>

Again this is effectively the same as if there was a type="text" attribute on the content element. Once again there is the option of using type="html", as in listing 9.

Listing 9: Embedded HTML content construct


<content type="html">
The &lt;code&gt;atom:content&lt;/code&gt; element either contains or links to the
content of the entry.  The content of &lt;code&gt;atom:content&lt;/code&gt; is
&lt;a href="http://www.ietf.org/rfc/rfc3066.txt"&gt;Language-Sensitive&lt;/a&gt;.
</content>

You can also use CDATA sections, similarly to listing 5, or preferably use type="xhtml", similarly to listing 6. You can also embed other textual formats if you specify a type with a value starting with text/, in which case the content must not have any child elements and must be text, with escaping applied where necessary.

Atom content also allows arbitrary XML content, as long as you provide an XML media type in the type attribute, with "XML media type" as defined in RFC 3023. Listing 10 shows how you would embed an SVG image in Atom content.

Listing 10: SVG as Atom content


<content type="image/svg+xml">
<svg xmlns="http://www.w3.org/2000/svg"
  width="100px" height="100px">
  <title>Itsy bitsy SVG</title>
  <circle cx="40" cy="25" r="20" style="fill: black;"/>
  <text x="10" y="80" fill="blue">Hello World</text>
</svg>
</content>

If you want to have content in-line while using any non-text and non-XML type, you must include it as Base64 encoded form. Listing 11 is a PNG image embedded as Atom content.

Listing 11: PNG as Atom content, embedded


<content type="image/png">
iVBORw0KGgoAAAANSUhEUgAAAB8AAAAqCAYAAABLGYAnAAAABmJLR0QA/wD/AP+gvaeTAAAACXBI
WXMAAAsTAAALEwEAmpwYAAAAB3RJTUUH1QwCBCUlRSCuygAAAetJREFUWMPt1j1IVmEUB/Dfo1mv
Bg1iSgjhZNGnFIhTDUXQ25x9LE2NbQ1NLe1tbbo1tjQEDk2BUzSITUapUejyvkQthXKfhvvcEs3g
vl55Fd4/HLgfz//8zzn3Ps85tBGhBc4oLuIYuvAV77CwW0F24x7mEbex+bSmu0rhEcz9R3SzzSXO
jnEWjRLChTUSt2X0Y7EF4cIWk4+WMLUD4cKmtPhHr1cgvp58/RNd2zy/VdFf2518lcJsBVkXNls2
85GKt2qpE+4nDlUk/gu1Mpk3Ksy8UbbsSxWKL5UVn6lQfGZP7vM9ecK1/Wxva1drez/f1UlmX8xw
HXTQQQcd7Ctkl8jGiMeJk8SeDe9uEI+Q1bfyYg/xNrGf7CaxRhzP77esPUy8soFXLwbIU4QaPuM6
YY04SDxAOI2MMJGIw8Q0z4c14lg+k4dr6MEAoUkcIvYlzgAO4nKK5CgmknjIUou8mvflrJ6uH+Vt
+k/09UR88rc64Xnq4Su4gzXiKMbxgHgBTzGcfEyirxivitH5LeF1yvIkXuLZptKdwQ98Q28Sf58y
msbd1NdPJMKHlPFCWgfnsIxmEo/NvHRxCB/xgng/ZfkFg1glTBPP4w3h+4agHhOW8TAvuVcpuBV8
yrmxl7iKVKm4mn/WDtqA3yOQKuHaSApTAAAAAElFTkSuQmCC
</content>

Finally, you can use any content externally sourced by specifying a src attribute with the IRI (basically, "internationalized URI") of the content and being sure to specify a type attribute that is a proper media type and not one of the special text|html|xhtml values. Listing 12 is similar to listing 10 except that the PNG file is external to the Atom document.

Listing 12: PNG as Atom content, externally sourced


<content src="image.png" type="image/png"/>

Notice that the content element is empty. It must be so if you use src in this way. Of course you could get tricksy (to put it like Gollum) with data scheme URLs, which embed the content right in the URL itself. I do not for a moment recommend a trick such as in listing 13, where HTML is smuggled in even more diabolically than by using type="html", but I'm exploring the breadth of cases, so there it is.

Listing 13: HTML content provided in a data scheme URL (not recommended)


<content src="data:text/html,%26lt%3Bi%3E3733t%2C%20d00d%26lt%3B/i%3E" type="text/html"/>

%26lt%3Bi%3E3733t%2C%20d00d%26lt%3B/i%3E is the URL quoted version of &lt;i>3733t, d00d&lt;/i>

Approach for Processing Atom Content

To demonstrate a likely algorithm for processing all these text and content construct possibilities, listing 14 is Python code using some hypothetical functions for parsing Atom using DOM and then emitting an XML output. In effect, it shows skeleton pivot code for the boundary between one XML processing pipeline and another where the origin stage produces Atom output and the destination is some XML format (perhaps Atom as well). A real-world example of where I have used such code is in an aggregator that combines multiple Atom feeds into a single feed (an aggregator pattern). It could also be used to generate presentation XHTML from source Atom. I chose to make it skeleton code so you can feel free to substitute the XML generation toolkit of your choice, and so the algorithm can be copied more transparently to other languages such as ECMAScript, Ruby, or even XSLT.

Listing 14: Skeleton Python code for processing Atom input to produce XML output

import base64
from xml.sax.saxutils import unescape

def handle_text_construct(node):
    #Merge adjacent text nodes
    node.normalize()
    text_type = node.getAttributeNS(None, u"type")
    if text_type in [u"", u"text"]:
        write_cdata(node.firstChild.data)
    elif text_type == u"html":
        tagsoup = unescape(node.firstChild.data)
        tidied = tidy(tagsoup)
        write_literal_xml(tidied)
    elif text_type == u"xhtml":
        write_literal_xml(node.firstChild.data)
    else:
        raise TypeError("Illegal text construct type")
    return


def handle_content(node):
    content_type = node.getAttributeNS(None, u"type")
    content_src = node.getAttributeNS(None, u"src")
    if content_src:
        #For example write an XHTML object start tag
        write_ext_reference(src, type)
        return
    #Atom built-in types are handled same way as text constructs
    if text_type in [u"", u"text", u"html", u"xhtml"]:
        handle_text_construct(node)
        return
    node.normalize()
    #Check the XML type case before the text type case
    if text_type.endswith("/xml") or text_type.endswith("+xml"):
        write_literal_xml(node.firstChild.data)
    elif text_type.startswith(u"text/"):
        write_cdata(node.firstChild.data)
    else:
        #You may choose to handle such by
        #duplicating this construct, creating an entity with NDATA,
        #Using a reference with data type URL, or other means
        content = base64.decodestring(node.firstChild.data)
        handle_foreign_content(content)
    return

Using Data URLs in HTML Output

While I do not recommend tunneling tag-soup content in data scheme URLs when expressed in a non-tag-soup format such as Atom, such URLs can be a limited solution for one problem I've encountered in Atom processing. If you want to tunnel tag soup to an output that can handle it (say a web browser), and the browser understands data scheme URLs, you can skip the decoding then tidying step for processing type="html" and just URL encode the escaped HTML into a data URL in an object element for output.

Yes, this is a very suspicious hack, but it illustrates some of the desperate measures I have had to resort to when working with Atom given the realities of ubiquitous tag soup. Specifically I was trying to write a little personal feed viewer using XSLT so that I could render feed contents in Firefox. Writing object elements with data URLs was the easiest way to tame the escaped tag soup I was getting from upstream feeds. I would never have done such a thing if the next stage in the processing pipeline was what I consider a proper XML stage, but since it was directly to a web browser at the end of the line, I took the liberty. The relevant bit of XSLT was as in listing 15.

Listing 15: Sample XSLT for hacking escaped HTML into an HTML object with data URL


<xsl:transform version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:str="http://exslt.org/strings"
>
  <xsl:output method="html"/>

  <xsl:template match="atom:content[@type='html']">
    <object type="text/html" height="100" width="100"
      href="data:text/html,{str:encode-uri(string(.), false())}">
Unfortunately, your browser does not support data scheme URLs,
so you cannot view this embedded content
    </object>
  </xsl:template>

</xsl:transform>

At present data URLs are supported in Mozilla, Opera, Safari, and Konqueror. So far Internet Explorer does not support data URLs, which is a huge damper, but there are a lot of user requests for the feature, so it might make a surprise appearance in IE7. See, for example, comments in the IE team blog entry URLs in Internet Explorer 7.

Atomic Text Clean-up

For more on why you should avoid escaped HTML in Atom documents, see Escaped Markup Considered Harmful by Norm Walsh here on XML.com, and his follow-up Escaped Markup: What To Do Instead. Atom is of the XML family, and it's always best to keep as much as possible in the XML layer. If you do, processing will be easier and your data will be cleaner. Of course, since it's not always easy to keep the gloves on after more than a decade of tag soup on the Web, Atom makes it possible to deal with messy content without devolving completely to the chaotic content representation that marks many other web feed formats. In Atom, you at least have to properly declare your mess.

By the way, if this article interested you I'd like to invite you to join the Atom IRC channel on Freenode (#atom on irc.freenode.net), which I revived last month. We've settled down to a few regulars with people often popping in to ask quick questions or announce work in progress, but the more the merrier. Atom 1.0 is just out of the shrink-wrap and the Atom Publishing Protocol -- featured in Joe Gregorio's Restful Web column this week, Catching Up with the Atom Publishing Protocol -- is advancing towards production, so it's a great time to discuss user and implementation details in a friendly forum.



1 to 2 of 2
  1. Text is text
    2005-12-08 20:14:35 philringnalda
  2. Encoding of entities
    2005-12-08 11:51:31 chneukirchen
1 to 2 of 2