XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Processing Atom 1.0
by Uche Ogbuji | Pages: 1, 2, 3

Listing 2. MiniDOM Code to Print a Text Outline of an Atom Feed

from xml.dom import minidom
from xml.dom import EMPTY_NAMESPACE

ATOM_NS = 'http://www.w3.org/2005/Atom'

doc = minidom.parse('atomexample.xml')
#Ensure that all text nodes can be simply retrieved
doc.normalize()

def get_text_from_construct(element):
    '''
    Return the content of an Atom element declared with the
    atomTextConstruct pattern.  Handle both plain text and XHTML
    forms.  Return a UTF-8 encoded string.
    '''
    if element.getAttributeNS(EMPTY_NAMESPACE, u'type') == u'xhtml':
        #Grab the XML serialization of each child
        childtext = [ c.toxml('utf-8') for c in element.childNodes ]
        #And stitch it together
        content = ''.join(childtext).strip()
        return content
    else:
        return element.firstChild.data.encode('utf-8')

#process overall feed:

#First title element in doc order is the feed title
feedtitle = doc.getElementsByTagNameNS(ATOM_NS, u'title')[0]

#Field titles are atom text constructs: no markup
#So just print the text node content
print 'Feed title:', get_text_from_construct(feedtitle)

feedlink = doc.getElementsByTagNameNS(ATOM_NS, u'link')[0]
print 'Feed link:', feedlink.getAttributeNS(EMPTY_NAMESPACE, u'href')

print
print 'Entries:'

for entry in doc.getElementsByTagNameNS(ATOM_NS, u'entry'):
    #First title element in doc order within the entry is the title
    entrytitle = entry.getElementsByTagNameNS(ATOM_NS, u'title')[0]
    entrylink = entry.getElementsByTagNameNS(ATOM_NS, u'link')[0]
    etitletext = get_text_from_construct(entrytitle)
    elinktext = entrylink.getAttributeNS(EMPTY_NAMESPACE, u'href')
    print etitletext, '(', elinktext, ')'

The code to access XML is typical of DOM and, as such, it's rather clumsy when compared to much Python code. The normalization step near the beginning of the listing helps eliminate even more complexity when dealing with text content. Many Atom elements are defined using the atomTextConstruct pattern, which can be plain text, with no embedded markup. (HTML is allowed, if escaped, and if you flag this case in the type attribute.) Such elements can also contain well-formed XHTML fragments wrapped in a div. The get_text_from_construct function handles both cases transparently, and so it is generally a utility routine for extracting content from compliant Atom elements. In this listing, I use it to access the contents of the title element, which is in XHTML form in one of the entries in listing 1. Try running listing 2 and you should get the following output.

$ python listing2.py
Feed title: Example Feed
Feed link: http://example.org/

Entries:
Atom-Powered Robots Run Amok ( http://example.org/2005/09/02/robots )
<xh:div>
The quick <xh:del>black</xh:del><xh:ins>brown</xh:ins> fox...
      </xh:div> ( http://example.org/2005/09/01/fox )

Handling Dates

Handling Atom dates in Python is a topic that deserves closer attention. Atom dates are specified in the atomDateConstruct pattern, of which the specification says:

A Date construct is an element whose content MUST conform to the "date-time" production in [RFC3339]. In addition, an uppercase "T" character MUST be used to separate date and time, and an uppercase "Z" character MUST be present in the absence of a numeric time zone offset.

The examples given are:

  • 2003-12-13T18:30:02Z
  • 2003-12-13T18:30:02.25Z
  • 2003-12-13T18:30:02+01:00
  • 2003-12-13T18:30:02.25+01:00

You may be surprised to find that Python is rather limited in the built-in means it provides for parsing such dates. There are good reasons for this: many aspects of date parsing are very hard and can depend a lot on application-specific needs. Python 2.3 introduced the handy datetime data type, which is the recommended way to store and exchange dates, but you have to do the parsing into date-time yourself, and handle the complex task of time-zone processing, as well. Or you have to use a third-party routine that does this for you. I recommend that you complement Python's built-in facilities with Gustavo Niemeyer's DateUtil. (Unfortunately that link uses HTTPS with an expired certificate, so you may have to click through a bunch of warnings, but it's worth it.) In my case I downloaded the 1.0 tar.bz2 and installed using python setup.py install.

Using DateUtil, the following snippet is a function that returns a date read from an atom element:

from dateutil.parser import parse

feedupdated = doc.getElementsByTagNameNS(ATOM_NS, u'updated')[0]
dt = parse(feedupdated.firstChild.data)

And as an example of how you can work with this date-time object, you can use the following code to report how long ago an Atom feed was updated:

from datetime import datetime
from dateutil.tz import tzlocal

#howlongago is a timedelta object from present time to target time
howlongago = dt - datetime.now(tzlocal())
print "Time since feed was updated:", abs(howlongago)

Pages: 1, 2, 3

Next Pagearrow