
Normalizing Syndicated Feed Content
So you want to write a program to read RSS and Atom syndicated feeds. Sounds simple enough. After all, RSS stands for "Really Simple Syndication" (or "Rich Site Summary", or "RDF Site Summary", or something), and Atom is just RSS with different tag names, right? Well, not exactly.
First, you need to realize that there are multiple versions of RSS. I wrote about this a year and a half ago in my inaugural Dive Into XML article, and the problem has gotten worse. Not because the RSS specification has evolved; in fact it's frozen in its current incomprehensible state. No, the problem is that it's frozen, and various well-meaning third parties have been using its namespace support to add features to make up for perceived weaknesses in the base specification.
Second, you need to know about Atom. First proposed last summer, and still in active development, it has been implemented by a number of early adopters, including Blogger, Typepad, and LiveJournal. That means there are approximately 3 million Atom feeds in the wild. So it's important enough for you to pay attention to it, and it's not that hard to learn. It has not fractured into multiple incompatible versions, but it's early yet. These things take time.
I'm going to use XPath throughout this article to describe how to pick out the pieces of data you want from various flavors of syndicated feeds. Here are the namespace conventions I'm using:
xmlns:atom="http://purl.org/atom/ns#"xmlns:rss10="http://purl.org/rss/1.0/"xmlns:rss09="http://my.netscape.com/rdf/simple/0.9/"xmlns:dc="http://purl.org/dc/elements/1.1/"xmlns:dcterms="http://purl.org/dc/terms/"xmlns:content="http://purl.org/rss/1.0/modules/content/"xmlns:l="http://purl.org/rss/1.0/modules/link/"xmlns:xhtml="http://www.w3.org/1999/xhtml"
Title
Title is just that, a title. Feeds generally have titles, and then individual entries within feeds may or may not have their own titles. Entry titles are required in RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 1.0, and Atom, although in Atom the entry title may be blank. Also, RSS 2.0 entry titles are required if there is no entry description.
To find the feed-level title:
/atom:feed/atom:title(example 1)/rdf:RDF/rss10:channel/rss10:title(example 2)/rdf:RDF/rss10:channel/dc:title(example 3)/rdf:RDF/rss09:channel/rss09:title(example 4)/rss/channel/title(example 5)/rss/channel/dc:title(example 6)
To find the entry-level title:
/atom:feed/atom:entry/atom:title(example 7, example 8, example 9)/rdf:RDF/rss10:item/rss10:title(example 10)/rdf:RDF/rss10:item/dc:title(example 11)/rdf:RDF/rss09:item/rss09:title(example 12)/rss/channel/item/title(example 13)/rss/channel/item/dc:title(example 14)
Let's talk about the Atom content model. It's a hybrid model;
values can be included in a feed in several ways. The two main things
to know are the type and mode attributes.
The mode attribute will tell you how the data is encoded,
and the type attribute will tell you the content type
(once you've decoded the value). That's not as complicated as it
sounds. Here are examples of the most common usage:
<title>A plain text title</title>The most common kind of title, simply plain text. The @type attribute is omitted and defaults to "text/plain".
<title type="text/html" mode="escaped">A title with <em>embedded markup</em> in it</title>Also common, especially in Blogger feeds. The @type attribute is set to "text/html" to indicate that the title contains HTML markup, and @mode is set to "escaped" to indicate that the markup is included as entity-encoded text.
<title type="application/xhtml+xml" mode="xml"><div xmlns="http://www.w3.org/1999/xhtml">A title with <em>inline markup</em> in it</div></title>The @type attribute is set to "application/xhtml+xml" to indicate that the title contains XHTML markup, and @mode is set to "xml" to indicate that the XHTML is included inline in its own namespace, without an additional level of entity-encoding.
Other types are possible but rare enough to ignore at this point.
Several Atom elements share this content model, including:
/atom:feed/atom:tagline/atom:feed/atom:tagline/atom:feed/atom:copyright/atom:feed/atom:info/atom:feed/atom:entry/atom:title/atom:feed/atom:entry/atom:summary/atom:feed/atom:entry/atom:content
Now let's talk about the RSS 2.0 content model. Not so much a
content model as a series of unhappy accidents. For example, in RSS
2.0, it's unclear whether title can contain HTML markup.
The RSS 2.0 specification is silent on the issue. The specification
author has, at different times, publicly stated that it's permitted
and that it's not permitted; neither statement has made it into the
specification or into an official erratum. Either way, you are guaranteed
to get it wrong an unknown percentage of the time.
Alternate link
The alternate link is a link to a different representation of the
content of the feed or the content of the entry. For weblogs, the
feed link generally points to the home page of the weblog, and the
entry link generally points to the "permalink" (permanent archive page
for the entry). The representation pointed to is generally an HTML
page, and the alternate link is generally an http:// URL.
But for non-weblog uses of syndication, the alternate link could point
to some other kind of document. Some formats allow you to specify the
type of document, others do not.
To find the feed-level alternate link:
/atom:feed/atom:link[@rel="alternate" and @type="text/html"]/@href(example 15)/atom:feed/atom:link[@rel="alternate" and @type="application/xhtml+xml"]/@href(example 16)/rdf:RDF/rss10:channel/rss10:link(example 17)/rdf:RDF/rss10:channel/dc:relation/@rdf:resource(example 18)/rdf:RDF/rss10:item/l:link[@l:rel="permalink" and @l:type="text/html"]/@rdf:resource(example 19)/rdf:RDF/rss10:item/l:link[@l:rel="permalink" and @l:type="application/xhtml+xml"]/@rdf:resource(example 20)/rdf:RDF/rss09:channel/rss09:link(example 21)/rss/channel/link(example 22)/rss/channel/dc:relation/@rdf:resource(example 23)/rss/channel/item/l:link[@l:rel="permalink" and @l:type="text/html"]/@rdf:resource(example 24)/rss/channel/item/l:link[@l:rel="permalink" and @l:type="application/xhtml+xml"]/@rdf:resource(example 25)
To find the Entry-level alternate link:
/atom:feed/atom:entry/atom:link[@rel="alternate" and @type="text/html"]/@href(example 26)/atom:feed/atom:entry/atom:link[@rel="alternate" and @type="application/xhtml+xml"]/@href(example 27)/rdf:RDF/rss10:item/rss10:link(example 28)/rdf:RDF/rss10:item/@rdf:about(example 29)/rdf:RDF/rss10:item/l:link[@l:rel="permalink" and @l:type="text/html"]/@rdf:resource(example 30)/rdf:RDF/rss10:item/l:link[@l:rel="permalink" and @l:type="application/xhtml+xml"]/@rdf:resource(example 31)/rdf:RDF/rss09:item/rss09:link(example 32)/rss[@version="2.0"]/channel/item/guid[not(@isPermaLink)](example 33)/rss[@version="2.0"]/channel/item/guid[@isPermaLink="true"](example 34)/rss/channel/item/link(example 35)/rss/channel/item/l:link[@l:rel="permalink" and @l:type="text/html"]/@rdf:resource(example 36)/rss/channel/item/l:link[@l:rel="permalink" and @l:type="application/xhtml+xml"]/@rdf:resource(example 37)/rss[@version="2.0"]/channel/item/comments(example 38)
Be aware of relative URIs. In Atom feeds, link URIs can be relative, as defined in the XML Base specification.
|
More Dive Into XML Columns | |
There is widespread confusion over which element is an entry's
alternate link in RSS 2.0. RSS was invented before there was
widespread use of "permalinks" at all, and the original use of
the /rss/channel/item/link element was to point to an
external article. As full-content syndication became more prevalent,
and more people started producing their own content and syndicating it
on their own site, that element came to be used as the permalink. But
RSS 2.0 introduces an /rss/channel/item/guid element,
which, by default, acts as a permalink. But it can also be used as an
opaque unique identifier that is not an URL (or even a URI), if
the isPermaLink attribute is set to false.
It is not clear what role the /rss/channel/item/link
element now plays in RSS 2.0, and many people still use it as a
permalink, partly because aggregators were slow to
support guid.
To make matters worse for syndication consumers, there is no
guidance in the specification about what to do if an entry contains
both a link and a guid, as seen in these popular
New York Times feeds. Which takes precedence? The specification
is silent on this issue. My own rule of thumb is to
give guid precedence, since it's the newer element and
its usage is almost certainly intentional; but if you look closely at
those New York Times feeds, you'll see that link is
actually a better permalink, since it contains extra query
string parameters that allow anyone to read the article without
registering and storing New York Times cookies. And some people
use link to point to external articles
and guid to point to internal permalinks. This is
largely an unresolvable issue; pick one, and know that you will get it
wrong an unknown percentage of the time.
Using /rss/channel/item/comments as a permalink is
unusual; as a consumer, I would only look for it if everything else
was missing, and even then I would verify that it was actually
an http:// URL (some feeds use an email address to allow
comments by email).
Summary and Full Content
This is the third and most contentious element of a syndicated
feed, and the one which has suffered the most from the slings and
arrows of history. In the original versions of RSS (RSS 0.90,
Netscape RSS 0.91, and Userland RSS 0.91),
the /rss/channel/item/description element was a plain
text summary of the article linked to by
the /rss/channel/item/link.
However, RSS 0.92 made two major and backwardly incompatible
changes. First, it made all the entry elements optional; second, it
allowed description to contain HTML markup and contain
the full HTML content of the entry. But this usage (like everything
else in RSS 0.92) was optional, and many people continued to
use description to provide plain text summaries. Which
is fine until you realize that there is no way to programmatically
distinguish between an HTML description (one that contains HTML
markup) and a plain text description that talks about HTML markup. Is
"History of the <blink> tag" a plain text summary of an article
that talks about the history of the <blink> tag, or is it HTML?
No way to know.
The other problem is that there is no way to know whether an
entry's description is being used as a summary or as full
content. Atom solves this problem by simply defining
separate summary and content elements, and
Atom feeds generally contain one or the other (although mine includes
both, which is also valid).
At the feed level, the summary is uncontroversial and relatively unabused. It is generally used as a short (plain text) description of the site, a tagline such as "All the news that's fit to print". Here is how to find it:
/atom:feed/atom:tagline(example 39)/rdf:RDF/rss10:channel/rss10:description(example 40)/rdf:RDF/rss10:channel/dc:description(example 41)/rdf:RDF/rss09:channel/rss09:description(example 42)/rss/channel/description(example 43)/rss/channel/dc:description(example 44)
At the entry level is where we run into the most problems.
To find the entry-level HTML summary:
/atom:feed/atom:entry/atom:summary[@type="text/html"](example 45)/atom:feed/atom:entry/atom:summary[@type="application/xhtml+xml"](example 46)/rss[@version="0.94"]/channel/item/description[not(@type)](example 47)/rss[@version="0.94"]/channel/item/description[@type="text/html"](example 48)/rss/channel/item/description(example 49)
To find the entry-level plaintext summary:
/atom:feed/atom:entry/atom:summary[not(@type)](example 50)/atom:feed/atom:entry/atom:summary[@type="text/plain"](example 51)/rdf:RDF/rss10:item/rss10:description(example 52)/rdf:RDF/rss10:item/dc:description(example 53)/rdf:RDF/rss10:item/dcterms:abstract(example 54)/rdf:RDF/rss09:item/rss09:description(example 55)/rss[@version="0.94"]/channel/item/description[@type="text/plain"](example 56)/rss/channel/item/description(example 57)/rss/channel/item/dc:description(example 58)/rss/channel/item/dcterms:abstract(example 59)
To find the entry-level HTML full content:
/atom:feed/atom:entry/atom:content[@type="text/html"](example 60)/atom:feed/atom:entry/atom:content[@type="application/xhtml+xml"](example 61)/rdf:RDF/rss10:item/content:encoded(example 62)/rss[@version="2.0"]/channel/item/xhtml:body(example 63)/rss[@version="2.0"]/channel/item/xhtml:div(example 64)/rss[@version="2.0"]/channel/item/content:encoded(example 65)/rss[@version="0.94"]/channel/item/description[not(@type)](example 66)/rss[@version="0.94"]/channel/item/description[@type="text/html"](example 67)/rss/channel/item/description(example 68)
Remember the Atom content model. atom:content
and atom:summary share the same content model
as atom:title (described above).
Note that several RSS paths are listed more than once, due to the
fact that it is impossible to tell either the role or the content type
of the entry's description element. This is a quick
summary of the RSS content model, which is large and contains
multitudes:
- In RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, and RSS
1.0,
//item/descriptionis always a plain text summary. - In RSS 0.92, RSS 0.93, RSS 0.94, and RSS
2.0,
//item/descriptionis sometimes a summary and sometimes full entry content. There is no way to distinguish programmatically whether a description is a summary or full content. The existence of an additional content element in the same entry (such ascontent:encoded) is a good predictor thatdescriptionis a summary, but it's not conclusive. And many feeds, such as the default feeds produced by Movable Type, have a summary in thedescriptionelement but no full content anywhere. - In RSS 0.92, RSS 0.93, and RSS
2.0,
//item/descriptionmay be plain text or may include entity-encoded HTML markup. There is no way to tell, so you should probably treat it as HTML and accept that you will be wrong an unknown percentage of the time. - In RSS 2.0,
//item/xhtml:bodyand//item/xhtml:divalways contain inline XHTML markup in the XHTML namespace. - In RSS 1.0 and RSS 2.0,
//item/content:encodedis always entity-encoded HTML markup. - RSS 1.0 has an additional RDF-based model for rich content not listed above, described in the mod_content specification. It is beyond my ability to describe it in XPath, if indeed it is possible at all.
- In RSS 0.94,
//item/descriptiondefaults to containing entity-encoded HTML markup, but it has an optional @type attribute that can specify "text/plain" instead.
Further reading
If you haven't torn all of your hair out by now, here are some additional links that are required reading for anyone silly enough to want to write a syndication consumer:
- The myth of RSS compatibility
- Early history of RSS
- History of the RSS fork
- RSS+Atom. Apparently all of the above is not confusing enough, so some people want to embed Atom into RSS.
- RSS 0.90 specification
- Netscape RSS 0.91 specification
- Userland RSS 0.91 specification
- RSS 0.92 specification
- RSS 1.0 specification
- RSS 2.0 specification
- Atom 0.3 specification
2010-07-08 07:48:24 penchenk- article
2007-07-28 14:30:29 profesjonalna