Menu

Implementing the Atom Publishing Protocol

July 19, 2006

Joe Gregorio

The Atom Publishing Protocol (APP) is nearing completion, many of the issues that I pointed out in a previous article have settled down, and there is work being done on implementations and interoperability. Although the interoperability work will go on for years to come, we can put together an implementation and discuss the requirements the APP puts on you, the gotchas, and the ways we can optimize the service. If you've been following along with Restful Web columns at home, you won't be surprised that the implementation is in Python. In future articles we'll start building more complex services on top of this APP implementation.

Before we dive into the code, let's back up and take a high-level look at what we need to implement. For reference, we'll be implementing draft -08 of the Atom Publishing Protocol. At the conclusion of "How to Create a REST Protocol" the four primary questions about a REST protocol are answered and a table describes the protocol. Table 1 is just such a table for the Atom Publishing Protocol.

Table 1.

Resource

HTTP Method

Representation

Description

Introspection

GET

Introspection Document

Enumerates a set of collections and lists their URIs and other information about the collections.

Collection

GET

Atom Feed

A list of member of the collection. Note that this may be a subset of all entries in the collection.

Collection

POST

Atom Entry

Create a new entry in the collection.

Member

GET

Atom Entry

Get the Atom Entry.

Member

PUT

Atom Entry

Update the Atom Entry.

Member

DELETE

N/A

Delete the Atom Entry from the collection.

All of these operations are pretty obvious, except using GET on a collection. In that case, the response to a GET may not return all of the entries in a collection. Actually, if it's a large collection I really hope it doesn't return all the entries in the collection. So the initial GET returns what may be a subset of the entries in the collection, ordered in reverse chronological order of their atom:updated date. Note that this means the most recently updated entries are returned first in the feed. If the feed returned doesn't contain all the entries in the collection, then the feed will contain an atom:link element of type "next" that points to another feed with the next set of entries in the collection -- it will also be ordered in reverse atom:updated chronological order. Reverse atom:updated chronological order is quite a mouthful, and it doesn't even abbreviate very nicely; "RAUCO" sounds more like a Soprano's character than a technical term.

Table 1 is good, but if we are going to generate a concrete implementation, then we need to add more detailed information. We'll add another column for the URIs of each of the resources and drop the description column to save space (see Table 2).

Table 2.

URI

Resource

HTTP Method

Representation

/collection/introspection/

Introspection

GET

Introspection Document

/collection/

Collection

GET

Atom Feed

/collection/

Collection

POST

Atom Entry

/collection/member/{id}

Member

GET

Atom Entry

/collection/member/{id}

Member

PUT

Atom Entry

/collection/member/{id}

Member

DELETE

N/A

Of course, those aren't really URIs in the URI column -- the last three rows have URI Templates instead of URIs. That is, you substitute the string {id} with some value, in this case some unique identifier for each entry in the collection, in order to obtain a full URI. The idea of URI Templates isn't new; I provided code for how to handle URI Template variable expansion in "Constructing or Traversing URIs?". We now need to take three short side trips to gather the pieces we need to put together our implementation of the APP.

A Place for My Stuff

The first thing we will need is a place to store our Atom Entries. Let's look back at Table 2 and see about our requirements. Each entry needs to be able to retrieved, updated, and deleted based on a single id. Here is a sketch of the interface for the Python class:

class Store:



    def get(self, id):

        pass



    def delete(self, id):

        pass



    def put(self, id, entry):

        pass

We'll assume that the entry taken in put() and returned by get() will be in the form of a string. In addition, we need the ability to add new entries to the collection using just an entry document, and that creation process needs to report the id that was assigned the new entry.

class Store:



    def get(self, id):

        pass



    def delete(self, id):

        pass



    def put(self, id, entry):

        pass



    def post(self, entry):

        pass

And, finally, we need to enumerate the members of a collection, which means we have to generate a linked chain of atom feeds that enumerate the entries in the collection in reverse atom:updated order (aka el-RAUCO). If we assume that we will generate the feeds with a fixed number of entries per feed, then we need a way to query with a size and offset from the beginning of the list. The index method will just return a subset of the entries' ids that are of size length and the first entry is offset back in the list.

class Store:



    def get(self, id):

        pass



    def post(self, entry):

        pass



    def delete(self, id):

        pass



    def put(self, id, entry);

        pass



    def index(self, size, offset):

        pass

Another requirement, which we have by virtue of building a web service, is that the underlying data store needs to be able to safely operate when being accessed simultaneously by multiple processes or threads. That requirement was met by building the underlying store on top of SQLite, which automatically handles accesses from multiple processes or threads. The Store also parses each incoming entry to ensure that it is at least well-formed and also updates fields that need to be updated. For example, on creation an entry will be assigned a unique id and that id will overwrite the value in the atom:id. There are also hooks for custom behaviors on updates. I will save the rest of the technical details of the Store class for another day. What I really want to concentrate on now is how implementing a RESTful protocol with the right tools is easy and the advantages you can get from using HTTP correctly.

A Word About WSGI

Please read PEP 333, a nicely written and detailed account of the Web Services Gateway Interface (WSGI). WSGI is an API for writing web services or components in Python. It also allows you to write applications in a platform independent manner. Wsgiref is a library that will be making it's Python core library debut in Python 2.5; it includes a reference implementation of a server, some middleware, and a WSGI application. We'll write our APP implementation as a WSGI application, which gives us more portability and opens up possibilities to write less code, which is always a good thing.

WSGI is simple and simply explained:

The WSGI interface has two sides: the "server" or "gateway" side, and the "application" or "framework" side. The server side invokes a callable object that is provided by the application side. The specifics of how that object is provided are up to the server or gateway. It is assumed that some servers or gateways will require an application's deployer to write a short script to create an instance of the server or gateway, and supply it with the application object. Other servers and gateways may use configuration files or other mechanisms to specify where an application object should be imported from, or otherwise obtained. [PEP 333]

So the application side is a callable object, and if you are familiar with Python you realize soon that you can start doing functional type things with callable objects, like composing them. That observation leads to a concept of "middleware." Not to be confused with high-priced enterprisey solutions, this kind of middleware is made of Python objects that wrap themselves around application objects to provide enhanced behavior.

In addition to "pure" servers/gateways and applications/frameworks, it is also possible to create "middleware" components that implement both sides of this specification. Such components act as an application to their containing server, and as a server to a contained application, and can be used to provide extended APIs, content transformation, navigation, and other useful functions. [PEP 333]

Here is an example of a simple WSGI application, straight from PEP 333:

def simple_app(environ, start_response):

    """Simplest possible application object"""

    status = '200 OK'

    response_headers = [('Content-type','text/plain')]

    start_response(status, response_headers)

    return ['Hello world!\n']

I won't go into any further detail on WSGI here. PEP 333 does a very good job of describing it, and I heartily suggest you go read the PEP if you are at all curious.

Selector

Selector is a piece of WSGI middleware from Luke Arno that, "...provides WSGI middleware for 'RESTful' mapping of URL paths to WSGI applications." So if we know our URI structure and have built WSGI applications for each of the resources in our application, Selector lets us map all those pieces together in a completely natural way, by mapping from URI Templates and method names into WSGI applications. Let's take Table 2 from above and redo it one more time to drop the resource and representation columns and instead plug in our WSGI application names (see Table 3).

Table 3.

URI

Method

WSGI Application

/collection/introspection/

GET

introspection

/collection/

GET

enumerate_collection

/collection/

POST

create_new_entry

/collection/member/{id}

GET

member_get

/collection/member/{id}

PUT

member_update

/collection/member/{id}

DELETE

member_delete

Selector makes it easy to specify such a service. Assuming our applications are already defined, we can create a selector object that does the mapping:

import selector



s = selector.Selector()

s.add('/collection/introspection/', GET=introspection)

s.add('/collection/', POST=create_new_entry, GET=enumerate_collection)

s.add('/collection/member/{id}', GET=member_get, PUT=member_update, DELETE=member_delete)

If we wanted to run our service as a CGI application we can use the wrapper provided in the wsgirf library.

 from wsgiref.handlers import CGIHandler CGIHandler().run(s)

So, all that's left is the individual applications themselves -- the ones that do the work and provide an interface into our Store class. Let's look at the implementation of the WSGI application to create a new entry, remembering that a WSGI application is just a function or callable object that implements the WSGI interface. In this case the application is implemented as a function.

Create an Entry

def create_new_entry(environ, start_response):

    # 1. Check for a good Content-Type: header.

    content_type = environ.get('CONTENT_TYPE', '') 

    content_type = content_type.split(';')[0]

    if content_type and content_type != 'application/atom+xml':

        start_response("400 Bad Request", [('Content-Type','text/plain')])

        return ["Wrong media type."] 



    # 2. Read in the entry

    length = int(environ['CONTENT_LENGTH'])

    content = environ['wsgi.input'].read(length)

 

    # 3. Store the entry

    store = getstore(environ)

    id = store.post(content)

  

    # 4. Response includes a Location: header

    start_response("201 Created", 

        [

          ('Location', urljoin(

               wsgiref.util.application_uri, 

               expand_uri_template('/collection/member/{id}', {'id': id}))

           )

        ]

    )

    return [store.get(id).encode('utf-8')]

First, we do some basic checks (1) to ensure we have been sent the right kind of data.

Then we read in the entry that was sent (2) and place it in the store (3). If successful, we send a 201 Created response that includes a Location: header that points to the newly created resource. The response needs to include a Location: header with the URI of the newly created entry, and from the spec of HTTP (RFC 2616) we know that the URI returned must be an absolute URI.

The call to wsgiref.util.application_uri() gets us our base URI and then we use expand_uri_template() to expand the URI Template with the id we just assigned the new entry. The expand_uri_template() function is described fully in the XML.com article, "Constructing or Traversing URIs?".

If the store has problems with the entry -- for example, it isn't well-formed -- it will throw an exception that our WSGI wrapper will catch and turn into an appropriate error response. Note that the default error response isn't very helpful and a future enhancement will be to add more informative status codes and error messages.

Get an Entry

Let's look at the application that handles a GET on a member of the collection:

def member_get(environ, start_response):

    store = getstore(environ)

    # 1. Retrieve entry.

    body = store.get(environ['selector.vars']['id']).encode('utf-8')

    # 2. Send back to client.

    headers = [('Content-Type','application/atom+xml;charset=utf-8')]

    start_response("200 OK", headers)

    return [body]

This is rather simple since Selector pulls the id out of the request URI and places it in the environment as selector.vars. From the id we can retrieve the entry (1) from the Store and send it back to the client (2). Now I've talked in the past about using etags and the If-None-Match: header to speed up requests if a resource hasn't been updated since the last request. We will need to modify our application to calculate an etag, which in this case will just be an MD5 hash of the response body.

def member_get(environ, start_response):

    store = getstore(environ)

    body = store.get(environ['selector.vars']['id']).encode('utf-8')

    etag = md5.new(body).hexdigest()                      # 1

    incoming_etag = environ.get('HTTP_IF_NONE_MATCH', '') # 2

    if etag == incoming_etag:                             # 3

        start_response("304 Not Modified", [])

        return []

    else:

        headers = [('Content-Type','application/atom+xml;charset=utf-8'), 

              ('ETag', etag)                              # 4

        ]

        start_response("200 OK", headers)

        return [body]

We calculate the etag (1) for the response and return it via the ETag header (4). If the client has sent an old etag via the If-None-Match: header we get that etag (2) and compare it against the current etag (3) and if they match then we return with a status of 304 Not Modified and an empty response body, otherwise we just return the entry as before. This means that if a client supports etags and the response has not been updated since the last GET then the only data that passed over the wires is the response headers.

In this case we have built the etag handling right into the application to show how easy etag handling can be, but that probably isn't the best way to handle it. A much better approach would be to have our application compute the etag and have some WSGI middleware that wraps our applications that looks for the If-None-Match: header and handles the 304 response.

Etag handling isn't the only way to speed up your responses, the response can also be gzip'd. You have several choices when handling gzip. If you are running under Apache you can turn on mod_deflate and that will handle gzip'ing your content. An alternative is to add some WSGI middleware that handles it for you. Here is our startup code from earlier but with the addition of the gzipper middleware from Python Paste.

from wsgiref.handlers import CGIHandler

import paste.gzipper

s = paste.gzipper.middleware(s, None)

CGIHandler().run(s)

Note that we didn't have to change our applications at all, the functionality is completely orthogonal to the existing applications.

Delete an Entry

Deleting an entry is the mirror of GETting an entry -- we get the id of the entry from Selector's parsing of the request URI and we just pass the delete on down to the Store.

def member_delete(environ, start_response):

    store = getstore(environ)

    id = environ['selector.vars']['id']

    store.delete(id)

    start_response("200 OK", [])

    return []

Update an Entry

Updating an entry is equally simple in a naive implementation. We read in the sent entry (1) and after determining the id from the URI we put the entry into the store (2) at that location.

def member_update(environ, start_response):

    # 1. Read the entry

    length = int(environ['CONTENT_LENGTH'])

    content = environ['wsgi.input'].read(length)

    store = getstore(environ)

    id = environ['selector.vars']['id']

    # 2. Put the entry in the store.

    store.put(id, content)

    start_response("200 OK", [])

    return []

We can do better. One of the things we would like to protect against is lost updates. For example, two different clients request an entry at the same time (that's not a problem), both clients edit those entries (also not a problem); but then both clients PUT those modified entries back to the server -- now we have a problem! There will be a race condition and one of the client's edits will be lost. HTTP has a minimal set of capabilities that allows a server to detect a conflict and inform the client of that condition. The solution relies on etags, which we already used to optimize our GETs. In this case we rely on the GET to include an etag and then look for that etag in an If-Match header on the PUT request. If the new and old etag match, then we let the PUT proceed; otherwise, we will fail with a status code of 412 Precondition Failed.

def member_update(environ, start_response):

    length = int(environ['CONTENT_LENGTH'])

    content = environ['wsgi.input'].read(length)

    store = getstore(environ)

    id = environ['selector.vars']['id']

    body = store.get(id).encode('utf-8')                  # 1

    etag = md5.new(body).hexdigest()                      # 2

    incoming_etag = environ.get('HTTP_IF_MATCH', '*')     # 3

    if (etag == incoming_etag) or ('*' == incoming_etag): # 4

        store.put(id, content)

        start_response("200 OK", [])

        return []

    else:

        start_response("412 Precondition Failed", [])     # 5

        return []

We will need to determine the etag for the current entry (1)(2) and then compare that to (3) the etag sent in via the If-Match:header. If the two are equal (4), or if the value of the etag sent is '*', then the PUT request goes through as before. A value of '*' for If-Match: means that the client wishes the request to go through regardless of the current resources etag value, which gives the client a way to forcibly overwrite the server's current value. If the etags don't match (5) we reject the request with a 412 status code.

This code isn't optimal since we do a get() to retrieve the entry to calculate the etag just to check it against the incoming If-Match: header. A faster way would be to calculate and store the etag for each entry instead of recalculating it every time we need it.

There is also bug in this code; a request could come in from another client between the call to store.get() and store.put(). In reality we need to either have Store expose some sort of locking of the database or we need to push the etag functionality down into Store.

This isn't the only way to avoid the lost update problem. Google's GData implementation of the Atom Publishing Protocol gives a unique edit URI to each version of an entry. Every time the entry is updated the edit URI changes. If the client sends a PUT or DELETE to a stale edit URI, then the server returns with a status code of 409 Conflict. There are advantages to both approaches. With the ETag approach the Edit URI never changes, thus allowing local and intermediate caches to work better. In addition, the ETag approach gives a defined mechanism, If-Match: *, to forcibly overwrite an entry. The GData approach has the advantage the even naive clients will be protected from accidental overwrites. The ETag approach requires the client to know about preserving etags that the client sees in GET responses and using them in PUT requests back to the same URI, which is not required of clients of the GData implementation. On the other hand, both systems must be prepared to handle 4xx responses by doing a GET and applying the edits again, so on that account it's wash.

A Cliff Hanger

Next time we will finish looking at the implementations for introspection and enumeration the entries in a collection. That will require introducing a few more tools before we're done. After that we'll dig into the implementation of Store and start building some applications on top of of our APP implementation. Now, you might be asking yourself how we are going to go straight into building applications when I've said nothing about the associated HTML pages for each entry in the collection. In a traditional weblog implementation of the APP, the collection is just an analogue of the web pages that make up the blog, but that doesn't mean those web pages have to exist and our APP service can add plenty of value all on it's own. For a flavor of such a service that can be used, you can read th ACM Queue article "A Conversation with Werner Vogels".