What Is RSS
by Mark Pilgrim
|
Pages: 1, 2
Despite being RDF/XML, RSS 1.0 is structurally similar to previous versions of RSS -- similar enough that we can simply treat it as XML and write a single function to extract information out of either an RSS 0.91 or RSS 1.0 feed. However, there are some significant differences that our code will need to be aware of:
The root element is
rdf:RDFinstead ofrss. We'll either need to handle both explicitly or just ignore the name of the root element altogether and blindly look for useful information inside it.RSS 1.0 uses namespaces extensively. The RSS 1.0 namespace is
http://purl.org/rss/1.0/, and it's defined as the default namespace. The feed also useshttp://www.w3.org/1999/02/22-rdf-syntax-ns#for the RDF-specific elements (which we'll simply be ignoring for our purposes) andhttp://purl.org/dc/elements/1.1/(Dublin Core) for the additional metadata of article authors and publishing dates.We can go in one of two ways here: if we don't have a namespace-aware XML parser, we can blindly assume that the feed uses the standard prefixes and default namespace and look for
itemelements anddc:creatorelements within them. This will actually work in a large number of real-world cases; most RSS feeds use the default namespace and the same prefixes for common modules like Dublin Core. This is a horrible hack, though. There's no guarantee that a feed won't use a different prefix for a namespace (which would be perfectly valid XML and RDF). If or when it does, we'll miss it.If we have a namespace-aware XML parser at our disposal, we can construct a more elegant solution that handles both RSS 0.91 and 1.0 feeds. We can look for items in no namespace; if that fails, we can look for items in the RSS 1.0 namespace. (Not shown, but RSS 0.90 feeds also use a namespace, but not the same one as RSS 1.0. So what we really need is a list of namespaces to search.)
Less obvious but still important, the
itemelements are outside thechannelelement. (In RSS 0.91, theitemelements were inside thechannel. In RSS 0.90, they were outside; in RSS 2.0, they're inside. Whee.) So we can't be picky about where we look for items.Finally, you'll notice there is an extra
itemselement within thechannel. It's only useful to RDF parsers, and we're going to ignore it and assume that the order of the items within the RSS feed is given by their order of theitemelements.
But what about RSS 2.0? Luckily, once we've written code to handle RSS 0.91 and 1.0, RSS 2.0 is a piece of cake. Here's the RSS 2.0 version of the same feed:
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>XML.com</title>
<link>http://www.xml.com/</link>
<description>XML.com features a rich mix of information
and services for the XML community.</description>
<language>en-us</language>
<item>
<title>Normalizing XML, Part 2</title>
<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
<description>In this second and final
look at applying relational normalization techniques to W3C XML Schema data modeling,
Will Provost discusses when not to normalize, the scope of uniqueness and the
fourth and fifth normal forms.</description>
<dc:creator>Will Provost</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item>
<title>The .NET Schema Object Model</title>
<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
<description>Priya Lakshminarayanan describes
in detail the use of the .NET Schema Object Model for programmatic manipulation
of W3C XML Schemas.</description>
<dc:creator>Priya Lakshminarayanan</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
<item>
<title>SVG's Past and Promising Future</title>
<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
<description>In this month's SVG column,
Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
<dc:creator>Antoine Quint</dc:creator>
<dc:date>2002-12-04</dc:date>
</item>
</channel>
</rss>
As this example shows, RSS 2.0 uses namespaces like RSS 1.0, but
it's not RDF. Like RSS 0.91, there is no default namespace and items
are back inside the channel. If our code is liberal
enough to handle the differences between RSS 0.91 and 1.0, RSS 2.0
should not present any additional wrinkles.
How can I read RSS?
Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of Python does not come with an XML parser; you will need to install PyXML first.)
from xml.dom import minidom
import urllib
def load(rssURL):
return minidom.parse(urllib.urlopen(rssURL))
This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.
The next bit is the tricky part. To compensate for the differences
in RSS formats, we'll need a function that searches for specific
elements in any number of namespaces. Python's XML library includes a
getElementsByTagNameNS which takes a namespace and a tag
name, so we'll use that to make our code general enough to handle RSS
0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90.
This function will find all elements with a given name,
anywhere within a node. That's a good thing; it means that we can
search for item elements within the root node and always
find them, whether they are inside or outside the channel
element.
DEFAULT_NAMESPACES = \
(None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0
'http://purl.org/rss/1.0/', # RSS 1.0
'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90
)
def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
for namespace in possibleNamespaces:
children = node.getElementsByTagNameNS(namespace, tagName)
if len(children): return children
return []
Finally, we need two utility functions to make our lives easier.
First, our getElementsByTagName function will return a
list of elements, but most of the time we know there's only going to
be one. An item only has one title, one
link, one description, and so on. We'll
define a first function that returns the first element of
a given name (again, searching across several different namespaces).
Second, Python's XML libraries are great at parsing an XML document
into nodes, but not that helpful at putting the data back together
again. We'll define a textOf function that returns the
entire text of a particular XML element.
def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
children = getElementsByTagName(node, tagName, possibleNamespaces)
return len(children) and children[0] or None
def textOf(node):
return node and "".join([child.data for child in node.childNodes])
or ""
That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful information from each item:
DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)
if __name__ == '__main__':
import sys
rssDocument = load(sys.argv[1])
for item in getElementsByTagName(rssDocument, 'item'):
print 'title:', textOf(first(item, 'title'))
print 'link:', textOf(first(item, 'link'))
print 'description:', textOf(first(item, 'description'))
print 'date:', textOf(first(item, 'date', DUBLIN_CORE))
print 'author:', textOf(first(item, 'creator', DUBLIN_CORE))
print
Running it with our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or authors):
$ python rss1.py http://www.xml.com/2002/12/18/examples/rss091.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization
techniques to W3C XML Schema data modeling, Will Provost discusses when not
to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date:
author:
title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET
Schema Object Model for programmatic manipulation of W3C XML Schemas.
date:
author:
title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's
journey through 2002 and looks forward to 2003.
date:
author:
For both the sample RSS 1.0 feed and
sample RSS 2.0 feed, we also get dates and
authors for each item. We reuse our custom
getElementsByTagName function, but pass in the Dublin
Core namespace and appropriate tag name. We could reuse this same
function to extract information from any of the basic RSS modules.
(There are a few advanced modules specific to RSS 1.0 that would
require a full RDF parser, but they are not widely deployed in public
RSS feeds.)
Here's the output against our sample RSS 1.0 feed:
$ python rss1.py http://www.xml.com/2002/12/18/examples/rss10.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization
techniques to W3C XML Schema data modeling, Will Provost discusses when not
to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date: 2002-12-04
author: Will Provost
title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET
Schema Object Model for programmatic manipulation of W3C XML Schemas.
date: 2002-12-04
author: Priya Lakshminarayanan
title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's
journey through 2002 and looks forward to 2003.
date: 2002-12-04
author: Antoine Quint
Running against our sample RSS 2.0 feed produces the same results.
This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.
Related resources
|
2010-08-07 03:36:24 Annabel Larsen- Perl and RSS
2010-07-21 03:53:23 Herrin - RE:
2010-07-19 07:21:28 Telelista - Dental Scaler
2010-07-18 00:31:30 Dental Scaler
2010-07-14 07:14:23 nyul.web.id
2010-07-11 03:09:43 prames
2010-07-09 17:04:23 iphone 4 case
2010-07-09 03:59:24 arazone
2010-07-08 08:12:46 syreza
2010-07-07 18:27:46 bluesafadi
2010-07-07 01:58:33 bathroom furniture
2010-07-07 01:55:36 bathroom furniture- rss feeds
2010-07-06 03:21:57 bathroom furniture - small sleeper sofa
2010-07-04 19:27:53 najwan - this great
2010-06-28 06:27:12 7afar - thank
2010-06-26 08:33:02 hisrc
2010-06-26 07:24:32 maryhudson- What Is RSS
2010-06-22 06:30:24 Chcago Movers - rickyeka
2010-06-20 23:59:13 thevixi - rickyeka
2010-07-22 04:55:19 ingilizce türkçe sözlük - inexpensive car insurance
2010-06-19 09:30:51 abah14 - Finance Solutions
2010-06-18 20:24:45 Finance Solutions - eBook and software download
2010-06-17 12:52:23 one stop shopping blog - Purchase Structured Settlements
2010-06-17 12:12:50 Purchase Structured Settlements - ketawa, lucu, ngakak
2010-06-17 00:08:30 ketawa, lucu, ngakak
2010-06-16 20:54:04 Motogpwallpapers- tips home improvement
2010-06-16 20:44:42 tipshomeimprovement - thanks
2010-06-16 20:12:12 diabetes info - Online Law Degree
2010-06-16 14:48:06 romi2 - Open Source
2010-06-16 14:29:11 nyul.web.id - RSS Spider or Converter ????
2008-03-07 21:08:09 Mavricky - best extensible language
2006-12-20 13:17:47 cooldaddy - Whats RSS
2006-10-21 01:41:02 teddy4050 - Hot tip about providing definitions...
2006-09-18 09:31:04 tarltonp - Current version of RSS?
2005-12-01 07:49:30 jimUK - Current version of RSS?
2006-02-19 01:19:34 DoctorWho - Scroll Speed
2005-11-13 03:35:45 reimera - Where can I find a schema for RSS 2?
2005-08-23 13:05:32 cspurgeon - Where can I find a schema for RSS 2?
2007-05-29 14:38:02 azgrpa - A doubt about how RSS client should work
2005-04-11 06:53:49 insac - My Company doesn't want RSS
2004-04-07 08:47:00 Raleigh Swick - My Company doesn't want RSS
2004-08-07 20:09:08 Thogek - My Company doesn't want RSS
2005-04-18 17:55:27 ogaga - My Company doesn't want RSS
2004-06-27 00:35:47 greggman - My Company doesn't want RSS
2004-06-11 09:09:38 prakashnambiar - Please don't break XML!
2003-01-03 01:35:35 Henri Sivonen - Please don't break XML!
2005-01-19 04:20:03 Looking_past_XML - Please don't break XML!
2005-05-29 14:42:42 bwoodring - Trendy Fashion Jewelery
2010-06-17 12:20:28 Brankas - Please don't break XML!
2005-06-27 05:30:36 lbff - Please don't break XML!
2005-01-11 05:27:50 despil - RDF makes life difficult
2002-12-19 12:39:02 Mario Diana - RDF makes life difficult
2004-09-24 13:06:41 kes - free mp3 download
2010-06-16 21:06:51 free mp3 download - So that's what it is ;-)
2002-12-19 11:59:51 Danny Ayers