Processing Atom 1.0
by Uche Ogbuji
|
Pages: 1, 2, 3
Listing 2. MiniDOM Code to Print a Text Outline of an Atom Feed
from xml.dom import minidom
from xml.dom import EMPTY_NAMESPACE
ATOM_NS = 'http://www.w3.org/2005/Atom'
doc = minidom.parse('atomexample.xml')
#Ensure that all text nodes can be simply retrieved
doc.normalize()
def get_text_from_construct(element):
'''
Return the content of an Atom element declared with the
atomTextConstruct pattern. Handle both plain text and XHTML
forms. Return a UTF-8 encoded string.
'''
if element.getAttributeNS(EMPTY_NAMESPACE, u'type') == u'xhtml':
#Grab the XML serialization of each child
childtext = [ c.toxml('utf-8') for c in element.childNodes ]
#And stitch it together
content = ''.join(childtext).strip()
return content
else:
return element.firstChild.data.encode('utf-8')
#process overall feed:
#First title element in doc order is the feed title
feedtitle = doc.getElementsByTagNameNS(ATOM_NS, u'title')[0]
#Field titles are atom text constructs: no markup
#So just print the text node content
print 'Feed title:', get_text_from_construct(feedtitle)
feedlink = doc.getElementsByTagNameNS(ATOM_NS, u'link')[0]
print 'Feed link:', feedlink.getAttributeNS(EMPTY_NAMESPACE, u'href')
print
print 'Entries:'
for entry in doc.getElementsByTagNameNS(ATOM_NS, u'entry'):
#First title element in doc order within the entry is the title
entrytitle = entry.getElementsByTagNameNS(ATOM_NS, u'title')[0]
entrylink = entry.getElementsByTagNameNS(ATOM_NS, u'link')[0]
etitletext = get_text_from_construct(entrytitle)
elinktext = entrylink.getAttributeNS(EMPTY_NAMESPACE, u'href')
print etitletext, '(', elinktext, ')'
The code to access XML is typical of DOM and, as such, it's rather clumsy
when compared to much Python code. The normalization step near the beginning
of the listing helps eliminate even more complexity when dealing with text
content. Many Atom elements are defined using the atomTextConstruct pattern, which can be plain text, with no embedded markup. (HTML is allowed, if escaped,
and if you flag this case in the type attribute.) Such elements
can also contain well-formed XHTML fragments wrapped in a div.
The get_text_from_construct function handles both cases transparently,
and so it is generally a utility routine for extracting content from compliant
Atom elements. In this listing, I use it to access the contents of the title element,
which is in XHTML form in one of the entries in listing 1. Try running listing
2 and you should get the following output.
$ python listing2.py
Feed title: Example Feed
Feed link: http://example.org/
Entries:
Atom-Powered Robots Run Amok ( http://example.org/2005/09/02/robots )
<xh:div>
The quick <xh:del>black</xh:del><xh:ins>brown</xh:ins> fox...
</xh:div> ( http://example.org/2005/09/01/fox )
Handling Dates
Handling Atom dates in Python is a topic that deserves closer attention. Atom
dates are specified in the atomDateConstruct pattern, of which
the specification says:
A Date construct is an element whose content MUST conform to the "date-time" production in [RFC3339]. In addition, an uppercase "T" character MUST be used to separate date and time, and an uppercase "Z" character MUST be present in the absence of a numeric time zone offset.
The examples given are:
2003-12-13T18:30:02Z2003-12-13T18:30:02.25Z2003-12-13T18:30:02+01:002003-12-13T18:30:02.25+01:00
You may be surprised to find that Python is rather limited in the built-in
means it provides for parsing such dates. There are good reasons for this:
many aspects of date parsing are very hard and can depend a lot on application-specific
needs. Python 2.3 introduced the handy datetime data type, which
is the recommended way to store and exchange dates, but you have to do the
parsing into date-time yourself, and handle the complex task of time-zone processing,
as well. Or you have to use a third-party routine that does this for you. I
recommend that you complement Python's built-in facilities with Gustavo Niemeyer's DateUtil. (Unfortunately
that link uses HTTPS with an expired certificate, so you may have to click
through a bunch of warnings, but it's worth it.) In my case I downloaded the
1.0 tar.bz2 and installed using python setup.py install.
Using DateUtil, the following snippet is a function that returns a date read from an atom element:
from dateutil.parser import parse
feedupdated = doc.getElementsByTagNameNS(ATOM_NS, u'updated')[0]
dt = parse(feedupdated.firstChild.data)
And as an example of how you can work with this date-time object, you can use the following code to report how long ago an Atom feed was updated:
from datetime import datetime
from dateutil.tz import tzlocal
#howlongago is a timedelta object from present time to target time
howlongago = dt - datetime.now(tzlocal())
print "Time since feed was updated:", abs(howlongago)