Processing Atom 1.0
In the fast-moving world of weblogs and Web-based marketing, the approval of the Atom Format 1.0 by the Internet Engineering Task Force (IETF) as a Proposed Standard is a significant and lasting development. Atom is a very carefully designed format for syndicating the contents of weblogs as they are updated, the usual territory of RSS, but its possible uses are far more general, as illustrated in the description on the home page:
Atom is the name of an XML-based Web content and metadata syndication format, and an application-level protocol for publishing and editing Web resources belonging to periodically updated websites.
All Atom feeds must be well-formed XML documents, and are identified with the application/atom+xml media type.
Atom is a very important development in the XML and Web world. Atom technology is already deployed in many areas (though not all up-to-date with Atom 1.0), and parsing and processing Atom is quickly becoming an important task for web developers. In this article, I will show several approaches to reading Atom 1.0 in Python. All the code is designed to work with Python 2.3, or more recent, and is tested with Python 2.4.1.
The example I'll be using of an Atom document is a modified version of the introduction to Atom on the home page, reproduced here in listing 1.
Listing 1 (atomexample.xml). Atom Format 1.0 Example
<?xml version="1.0" encoding="utf-8"?> <feed xml:lang="en" xmlns="http://www.w3.org/2005/Atom" xmlns:xh="http://www.w3.org/1999/xhtml"> <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id> <title>Example Feed</title> <updated>2005-09-02T18:30:02Z</updated> <link href="http://example.org/"/> <author> <name>John Doe</name> </author> <entry> <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <title>Atom-Powered Robots Run Amok</title> <link href="http://example.org/2005/09/02/robots"/> <updated>2005-09-02T18:30:02Z</updated> <summary>Some text.</summary> </entry> <entry> <id>urn:uuid:8eb00d01-d632-40d4-8861-f2ed613f2c30</id> <title type="xhtml"> <xh:div> The quick <xh:del>black</xh:del><xh:ins>brown</xh:ins> fox... </xh:div> </title> <link href="http://example.org/2005/09/01/fox"/> <updated>2005-09-01T12:15:00Z</updated> <summary>jumps over the lazy dog</summary> </entry> </feed>
If you want to process Atom with no additional dependencies besides Python, you can do so using MiniDOM. MiniDOM isn't the most efficient way to parse XML, but Atom files tend to be small, and rarely get to the megabyte range that bogs down MiniDOM. If by some chance you are dealing with very large Atom files, you can use PullDOM, which works well with Atom because of the way the format can be processed in bite-sized chunks. MiniDOM isn't the most convenient API available, either, but it is the most convenient approach in the Python standard library. Listing 2 is MiniDOM code to produce an outline of an atom feed, containing much of the information you would use if you were syndicating the feed.