XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


An Atom-Powered Wiki

April 14, 2004

In my last article I covered the changes from version 7 to version 8 of the draft AtomAPI. Now the latest version of the AtomAPI is version 9 which adds support for SOAP. This change, and its impact on API implementers, will be covered in a future article. In this article I'm going to build a simple implementation of the AtomAPI.

The first task at hand is to pick a viable candidate. I had a list of criteria which included working with a small code base, working in Python, and the target also being a slightly unconventional application of the AtomAPI. The reason I wanted a small code base in Python is that it's a language I'm familiar with, and small is good for the sake of exposition. The reason I picked an unconventional application of the AtomAPI is that I've found that to be a good technique for stretching a protocol, looking for strengths and weaknesses.

The application I've picked is PikiPiki, which is a wiki, a cooperative authoring system for the Web. It's written in Python, is GPL'd, has a small code base, and the code is easy to navigate. It also has a good lineage given that MoinMoin is based on PikiPiki. The source for both the client and the modified server described in this article can be downloaded from the EditableWebWiki.

To create an implementation of the AtomAPI there are a few operations we need to support. Each entry, which in the case of a wiki will be the content for a WikiWord, needs to have a unique URI called the EditURI that supports GET, PUT and DELETE. In addition a single PostURI that accepts POST to create new entries needs to be added. Last we'll add a FeedURI that supports GET to return a list of the entries. Supporting the listed operations on these URIs is all that's needed to have a fully functioning Atom server. (This of course ignores SOAP, which I'll cover later.)

Character Encoding

Character encoding is often overlooked. Despite that it's an important part of working with any XML format. Atom is no exception. Before making any additions to PikiPiki we'll need to make a few small changes to ensure that all of our data is encoded correctly. For a good introduction to character encoding consult the excellent introduction by Jukka Korpela.

To make things easier we can encode all of PikiPiki's data as UTF-8. There are many encoding to choose from, all with different advantages and disadvantages; but UTF-8 has some special properties: it allows us to use any Unicode character, for the most part treats the data like regular "C" strings, and we are guaranteed support by any conforming XML parser. Also, support for UTF-8 is one of the few things that most browsers do right.

Since this is a wiki, and for now all the data coming into it comes through a form, we need to ensure that all incoming data is encoded as UTF-8. The easiest way to do this is by specifying that the encoding for form page is UTF-8; lacking any other indications, a browser will submit the data from a form using the same character encoding that the page is served in. While HTML forms can specify alternate character sets that the server will accept when data is submitted, via the accept-charset attribute, support for this is spotty (meaning it worked perfectly in Mozilla, and I failed to get it working in Microsoft's Internet Explorer). So our first change to PikiPiki is to add a meta tag to the generated HTML.

def send_title(text, link=None, msg=None, wikiword=None):
  print "<head><title>%s</title>" % text
  print '<meta http-equiv="Content-Type" 
      content="text/html; charset=utf-8">'

Now all of our web pages should submit UTF-8 encoded data, and since all of the web pages produced from the wiki are combinations of ascii markup embedded in the Python program and the UTF-8 in the stored wiki entries, we can be sure our output is UTF-8.

A Wiki revolves around WikiWords, mixed-case words that are the title for and unique identifiers of every page on the wiki. In the case of PikiPiki, the WikiWord is also the filename that the text of the page is stored in.

The next change is to move the configuration of PikiPiki into a separatefile. We'll be creating two new CGI programs to handle the AtomAPI, and they both need access to some configuration information. The configuration section is just a set of global variables that we'll move into piki_conf.py:

from os import path
import cgi

data_dir = '/home/myuserpath/piki.bitworking.org/'
text_dir = path.join(data_dir, 'text')
editlog_name = path.join(data_dir, 'editlog')
cgi.logfile = path.join(data_dir, 'cgi_log')
logo_string = '<img src="/piki/pikipiki-logo.png" border=0 alt="pikipiki">'
changed_time_fmt = ' . . . . [%I:%M %p]'
date_fmt = '%a %d %b %Y'
datetime_fmt = '%a %d %b %Y %I:%M %p'
show_hosts = 0                   
css_url = '/piki/piki.css'       
nonexist_qm = 0                  


The next task at hand is to handle the functions of the EditURI. In the AtomAPI each entry has an associated EditURI, a URI you can dereference in order to retrieve the representation of the entry. You can also PUT an Atom entry to the EditURI to update the entry. In this case, each definition of a WikiWord in PikiPiki will act as a single entry. To handle the EditURI functions we'll create a Python script atom.cgi.

First let's map out the GET. We need to package up the UTF-8 encoded contents of a WikiWord and send it back. We need to decide on the form of the URI we are going to use. In this case we are going to be calling a CGI program and need to pass in the WikiWord as a parameter. We could pass it in either as a query parameter or we could pass it in as a sort of path. For example, in the first case, if the WikiWord was "FrontPage", the EditURI could be atom.cgi?wikiword=FrontPage. In the second place, the EditURI might be atom.cgi/FrontPage. Well choose the latter; the WikiWord will be passed in via the "PATH_INFO" environment variable.

def main(body):
  method = os.environ.get('REQUEST_METHOD', '')
  wikiword = os.environ.get('PATH_INFO', '/')
  wikiword = wikiword.split("/", 1)[1]      
  wikiword = wikiword.strip()

  word_anchored_re = re.compile(WIKIWORD_RE)

  if method == 'POST':
    ret = create_atom_entry(body)
  elif word_anchored_re.match(wikiword): 
    if method in ['GET', 'HEAD']:
      ret = get_atom_entry(wikiword)
    elif method == 'PUT':
      ret = put_atom_entry(wikiword, body)
    elif method == 'DELETE':
      ret = delete_atom_entry(wikiword)
      ret = report_status(405, 
        "Method not allowed", "")
    ret = report_status(400, "Not a valid WikiWord", 
      "The WikiWord you referred to is invalid.")
  return ret[1]

Our CGI pulls the HTTP method from the environment variable "REQUEST_METHOD" and the WikiWord from the "PATH_INFO" environment variable. Based on those two pieces of information we dispatch to the correct function. When we process GET we also are careful to respond to HEAD requests too. This is an important point, as the Apache web server will do the right thing with the HEAD response, that is, generate the right headers and send only the headers, discarding the body.

def get_atom_entry(wikiword):
  filename = getpath(wikiword)
  base_uri = piki_conf.base_uri
  if path.exists(filename):
    issued = last_modified_iso(filename)
    content = file(filename, 'r').read()
    issued = currentISOTime()
    content = "Create this page."
  return (200, ENTRY_FORM % vars())

Where ENTRY_FORM is defined as:

"""Content-type: application/atom+xml; charset=utf-8
Status: 200 Ok

<?xml version="1.0" encoding='utf-8'?>
<entry xmlns="http://purl.org/atom/ns#">
    <link rel="alternate" type="text/html" 
        href="%(base_uri)s/%(wikiword)s" />
    <content type="text/plain">%(content)s</content>

There are two important points to note about this code. The first is what we do if the desired WikiWord does not exist. If we were writing this for a typical CMS, for a GET for an entry that didn't exist we would normally return with a status code of 404. Wikis, in contrast, when dealing with the HTML content, present what appears to be an infinite URI space. That is, you can request any URI at a wiki and, as long as you specify a validly formed WikiWord, you won't get a 404. Instead you will get a web page that prompts you to enter the content for that WikiWord. Go ahead and try it on the PikiPiki wiki that is setup for testing this implementation of the AtomAPI. This WikiWord currently doesn't have a definition: http://piki.bitworking.org/piki.cgi/SomeWikiWordThatDoesntExist. To keep parity with the HTML interface, the AtomAPI interface works the same way.

The second point is character encoding. Note that we state character encoding in two places in the response, both in the HTTP header Content-type: and in the XML Declaration.

There are two more HTTP methods to handle for the EditURI, DELETE and PUT. PUT is used to update the content for a WikiWord, replacing the existing content with that delivered by the PUT. DELETE is used to remove an entry; it's easy to implement: just delete the associated file.

def delete_atom_entry(wikiword):
  ret = report_status(200, "OK", "Delete successful.")
  if wikiwordExists(wikiword):
      ret = report_status(500, "Internal Server Error", 
        "Can't remove the file associated with that word.")
  return ret

Note that unless something really bad happens, we return with a status code of 200 OK. That is, if the entry doesn't exist then we still return 200. You might be scratching your head if you remember we just talked about our implementation always returning an entry for every valid WikiWord, whether or not it actually had filled in content. That is, if you come right back and do a GET on the URI we just DELETE'd, it will not give you a 404, but instead will return the default filled in entry, "Create this page". Is this a problem? No. It may seem a bit odd, but it's not a problem at all. DELETE and GET are two different, orthogonal requests. There is no guarantee that some other agent, or some process on the server itself, didn't come along and recreate that URI between the DELETE and the GET.

Supporting PUT allows us to change the content of a WikiWord. To make the handling of XML easier I've used the Python wrapper for libxml2, an excellent tool for handling XML, in particular because it let's you use XPath expressions to query XML documents. In this case we're using them to pull out the content element.

def put_atom_entry(wikiword, content):
  ret = report_status(200, "OK", 
    "Entry successfully updated.")
  doc = libxml2.parseDoc(content)
  ctxt = doc.xpathNewContext()
  ctxt.xpathRegisterNs('atom', 'http://purl.org/atom/ns#')
  text_plain_content_nodes = ctxt.xpathEval(
    '/atom:entry/atom:content[@type="text/plain" or not(@type)]'
  all_content_nodes = ctxt.xpathEval('/atom:entry/atom:content')

  content = ""
  if len(text_plain_content_nodes) > 0:
    content = text_plain_content_nodes[0].content

  if len(text_plain_content_nodes) > 0 or len(all_content_nodes) == 0:
    writeWordDef(wikiword, content)
    append_editlog(wikiword, os.environ.get('REMOTE_ADDR', ''))
    # If there are 'content' elements but of some unknown type
    ret = report_status(415, "Unsupported Media Type", 
      "This wiki only supports plain texti")

  return ret

The detail to notice in the implementation is the XPath used to pick out the content element. Content elements may have a 'type' attribute, but if it is not present then it defaults to 'text/plain'. Since 'text/plain' is the only type of content we can support in a wiki, it's the only type of content we'll look for.

That takes care of the EntryURI; we just have the PostURI and FeedURI to go.

Pages: 1, 2

Next Pagearrow