XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Processing Atom 1.0
by Uche Ogbuji | Pages: 1, 2, 3

Using Amara Bindery

Because the DOM code above is so clumsy, I shall present similar code using a friendlier Python library, Amara Bindery, which I covered in an earlier article, Introducing the Amara XML Toolkit. Listing 3 does the same thing as listing 2.

Listing 3. Amara Bindery Code to Print a Text Outline of an Atom Feed

from amara import binderytools

doc = binderytools.bind_file('atomexample.xml')

def get_text_from_construct(element):
    '''
    Return the content of an Atom element declared with the
    atomTextConstruct pattern.  Handle both plain text and XHTML
    forms.  Return a UTF-8 encoded string.
    '''
    if hasattr(element, 'type') and element.type == u'xhtml':
        #Grab the XML serialization of each child
        childtext = [ (not isinstance(c, unicode)
                       and c.xml(encoding=u'utf-8') or c)
                      for c in element.xml_children ]
        #And stitch it together
        content = u''.join(childtext).strip().encode('utf-8')
        return content
    else:
        return unicode(element).encode('utf-8')

print 'Feed title:', get_text_from_construct(doc.feed.title)
print 'Feed link:', doc.feed.link

print
print 'Entries:'

for entry in doc.feed.entry:
    etitletext = get_text_from_construct(entry.title)
    print etitletext, '(', entry.link.href, ')'

Using Feedparser (Atom Processing for the Desperate Hacker)

A third approach to reading Atom is to let someone else handle the parsing and just deal with the resulting data structure. This might be especially convenient if you have to deal with broken feeds (and fixing the broken feeds is not an option). It does usually rob you of some flexibility of interpretation of the data, although a really good library would be flexible enough for most users. Probably the best option is Mark Pilgrim's Universal Feed Parser, which parses almost every flavor of RSS and Atom. In my case, I downloaded the 3.3 zip package and installed using python setup.py install. Listing 4 is code similar in function to that of listings 2 and 3.

Listing 4. Universal Feed Parser Code to Print a Text Outline of an Atom Feed

import feedparser

#A hack until Feed parser supports Atom 1.0 out of the box
#(Feedparser 3.3 does not)
from feedparser import _FeedParserMixin
_FeedParserMixin.namespaces["http://www.w3.org/2005/Atom"] = ""

feed_data = feedparser.parse('atomexample.xml')
channel, entries = feed_data.feed, feed_data.entries

print 'Feed title:', channel['title']
print 'Feed link:', channel['link']

print
print 'Entries:'

for entry in entries:
    print entry['title'], '(', entry['link'], ')'

Overall the code is shorter because we no longer have to worry about the different forms of Atom text construct. The library takes care of that for us. Of course I'm pretty leery of how it does so, especially the fact that it strips Namespaces in XHTML content. This is an example of the flexibility you lose when using a generic parser, especially one designed to be as liberal as Universal Feed Parser. That's a trade-off from the obvious gain in simplicity. Notice the hack near the top of listing 4. These two lines should be temporary, and no longer needed, once Mark Pilgrim updates his package to support Atom 1.0.

Wrapping up, on a Grand Scale

Atom 1.0 is pretty easy to parse and process. I may have serious trouble with some of the design decisions for the format, but I do applaud its overall cleanliness. I've presented several approaches to processing Atom in this article. If I needed to reliably process feeds retrieved from arbitrary locations on the Web, I would definitely go for Universal Feed Parser. Mark Pilgrim has dunked himself into the rancid mess of broken Web feeds so you don't have to. In a project where I controlled the environment, and I could fix broken feeds, I would parse them myself, for the greater flexibility. One trick I've used in the past is to use Universal Feed Parser as a proxy tool to convert arbitrary feeds to a single, valid format (RSS 1.0 in my past experience), so that I could use XML (or in that case RDF) tools to parse the feeds directly.

And with this month's exploration, the Python-XML column has come to an end. After discussions with my editor, I'll replace this column with one with a broader focus. It will cover the intersection of Agile Languages and Web 2.0 technologies. The primary language focus will still be Python, but there will sometimes be coverage of other languages such as Ruby and ECMAScript. I think many of the topics will continue to be of interest to readers of the present column. I look forward to continuing my relationship with the XML.com audience.

This brings me to the last hurrah of the monthly round up of Python-XML community news. Firstly, given the topic of this article, I wanted to mention Sylvain Hellegouarch's atomixlib, a module providing a simple API for generation of Atom 1.0, based on Amara Bindery. See his announcement. And relevant to recent articles in this column, Andrew Kuchling wrote up a Python Unicode HOWTO.

Julien Anguenot writes in XML Schema Support on Zope3:

I added a demo package to illustrate the zope3/xml schema integration. [Download the code here]

The goal of the demo is to get a new content object registered within Zope3, with an "add "and "edit" form driven by an XML Schema definition.

    

Also in Python and XML

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Making Old Things New Again

The article goes on to show a bunch of Python and XML code to work a sample W3C XML schema file into a Zope component.

Mark Nottingham announced sparta.py 0.8, a simple API for RDF.

Sparta is a Python API for RDF that is designed to help easily learn and navigate the Semantic Web programmatically. Unlike other RDF interfaces, which are generally triple-based, Sparta binds RDF nodes to Python objects and RDF arcs to attributes of those Python objects.

This makes using RDF very natural for people who understand (and sometimes think in terms of) objects. One way to think of it is as a databinding from RDF to Python objects.

See the announcement.

Guido Wesdorp announced Templess 0.1.

Templess is an XML templating library for Python, which is very compact and simple, fast, and has a strict separation of logic and design. It is different from other templating languages because instead of "asking" for data from the template, you "tell" the template what content there is to render, and the template just provides placeholders. Instead of calling into your code from the template, all data for the template is prepared in the code before it is handed over to the templating engine to render. This makes Templess very suitable for programmers, since everything is done from the Python code layer rather than using some domain-specific language from the XML.



1 to 4 of 4
  1. #1 Carpet Cleaning & Upholstery Clean Los Angeles 1-323-678-2704
    2009-06-13 21:00:12 whats
  2. atomixlib
    2005-10-16 11:38:47 SylvainH
  3. Dateutil moved and more accessible
    2005-10-03 07:12:42 Uche Ogbuji
  4. Atom 1.0 support for FeedParser
    2005-09-15 11:57:09 aristotle
1 to 4 of 4