Menu

Dreaming of an Atom Store: A Database for the Web

September 21, 2005

Joe Gregorio

After a year of work -- two if you count the work done before the AtomPub WG was formed -- the Atom Publishing Protocol (APP) is moving closer to being done. There are still parts that are unspecified, and other parts of the protocol are under debate even now. Even given that, some of the protocol is very stable, unchanged from even the earliest drafts. Despite the incomplete nature of the APP, there are plenty of people that are excited about it and are beginning to imagine all sorts of uses for it beyond a weblog publishing API, which was its original target. Over the course of the last couple months several of those ideas have collided often enough that we're starting to see a stable fusion. The ideas that are colliding are the APP and Amazon's OpenSearch. Let's take a moment for a quick recap:

Atom Publishing Protocol

The Atom Publishing Protocol leverages the work done on the Atom Syndication Format and the basics of HTTP to form a simple, yet powerful, publishing protocol. One of the things that doesn't get noticed at first about the Atom Syndication Format is that not only are feeds first class documents, but so are entries. This, for example, is a valid Atom document:

<entry xmlns="http://www.w3.org/2005/Atom">

    <title>Atom draft-07 snapshot</title>

    <link rel="alternate" type="text/html" 

     href="http://example.org/2005/04/02/atom"/>

    <link rel="enclosure" type="audio/mpeg" length="1337"

     href="http://example.org/audio/ph34r_my_podcast.mp3"/>

    <id>tag:example.org,2003:3.2397</id>

    <updated>2005-07-10T12:29:29Z</updated>

    <published>2003-12-13T08:29:29-04:00</published>

    <author>

      <name>Mark Pilgrim</name>

      <uri>http://example.org/</uri>

      <email>f8dy@example.com</email>

    </author>

    <contributor>

      <name>Sam Ruby</name>

    </contributor>

    <contributor>

      <name>Joe Gregorio</name>

    </contributor>

    <content type="xhtml" xml:lang="en" 

     xml:base="http://diveintomark.org/">

      <div xmlns="http://www.w3.org/1999/xhtml">

        <p><i>[Update: The Atom draft is finished.]</i></p>

      </div>

    </content>

  </entry>

The Atom Publishing Protocol is all about pushing around Atom Entries. For now, we'll assume that the APP is just used for editing a weblog, and that a weblog is made up of entries. Note that is a small "e" entry. For each small "e" entry there is a big "E" Entry that represents it. Each of those big "E" Entries lives at its own URI. Each entry in your weblog has a corresponding URI for the Atom Entry that represents it. Do an HTTP GET on that URI to get the Entry; PUT a new Entry to the URI to update the Entry and the corresponding small "e" entry gets updated too. HTTP DELETE on that URI and the small "e" entry is deleted. The Entries that are used to represent the entries in the weblog are grouped together in a Collection. That, too, is a resource and has its own URI. To add a new entry to your weblog you POST an Entry to the Collection, which in turn creates the small "e" entry.

Open Search

Amazon's A9 service launched two years ago to research and build innovative technologies to improve the search experience for e-commerce applications. One of those technologies is OpenSearch, a CC-licensed specification for search using Atom and RSS. In other words, OpenSearch defines a RESTful web service for searching, including a format for advertising what kind of search your site supports, and specifying how to return your search results in Atom or RSS.

Two Great Tastes That Taste Great Together

Now it's these two ideas, the APP and OpenSearch, that have started to show up together. Imagine enhancing the APP method of editing your weblog with OpenSearch. You could search across all your entries and find the right ones you want to edit or delete. You already have the capability to return an Atom Entry for each entry on your site if you implement the APP, so returning a bunch of them in the form of a feed in response to an OpenSearch request isn't such a great leap.

Here is where the idea itself starts to break loose from its beginnings and take on a life of its own. Imagine that there isn't a weblog associated with all those entries. Imagine that you just have a huge glob of storage that you can store Atom Entries in, and which you can edit using the APP, and then search over using OpenSearch. That idea, that big blob of Atom Entries, all editable and searchable, is an Atom Store.

An Atom Store

The idea of an Atom Store has been bouncing around the blogosphere for a bit now, though not always called by that name. Jesse Andrews points out a few of the sources of inspiration, and as far as I know he was the first person to use the term "Atom Store":

  • Mark Pilgrim's magicline and monkey do could use it to store data

  • Rohit Khare & Ben Sittler at Commerce.net have been working on requirements of an Atom Store.

  • Joe Gregorios[sp], author of Atom Publishing Protocol, is researching it.

  • You can even hear Google's Adam Bosworth request it on IT Conversations, hoping MySQL folks don't become Oracle as Oracle doesn't scale the way an Atom Store could scale.

The range of applications that are being talked about here is breathtaking. The monkeydo and magicline usage of an Atom Store would be a remote persistence mechanism for a Greasemonkey script. Contrast that to the ideas that Adam Bosworth is talking about, databases that scale like Google's GFS does today.

It's All About the REST

That's a huge range of applications, but I think such a thing could happen. There are several forces driving it. First, you and I have lots of data, and it's stored in lots of places. I have my weblog, my email, my subscriptions to all my syndication feeds, maybe a del.icio.us and flickr account, and so on, and so on. You are not going to combine those all into one big, happy service. Ever.

I want my choices and even if you are a big company and end up being able to provide all those services under one brand, I doubt I would trust all that data in one place. Instead of consolidating services, what syndication over the past 5 years suggests is that now I can aggregate feeds from all those places into a single dashboard that let's me view the status of my far-flung data empire in a single view. Now if all those sources of data not only supplied a feed, but also supported the interface of an Atom Store, well now that passive view changes into a real dashboard -- not only are those entries viewable, but they're editable from one spot.

Yes, I know that some aggregators support search, and some even support some of the current blogging APIs, but that's very different from every source being searchable and editable. An aggregator is only going to be able to search across entries that have appeared since it started subscribing to that feed, and not any earlier ones.

The other advantage of an Atom Store is that it's built on top of RESTful services. That means that we get the advantages of REST -- caching and uniform interfaces and hypermedia as the engine of application state. For both OpenSearch and the APP there is an XML document that describes the capabilities of each endpoint. They are self describing. That allows another service to come along and wrap several Atom Stores together by reading those description documents and then presenting itself as an Atom Store, an aggregate of all those stores it uses. Now that aggregate store could be a melange of your disparate data, your weblog, your email, etc. On the other hand, it could be a uniform series of servers each with a subset of a huge store: now you're building a monster database.

"Just" Use a Database

Aren't these just the same promises made in the early days of SQL? Sure they are, but I think an Atom Store has a better chance of meeting the hype for several reasons: The first is that the data model is not wide open like SQL; the format is pretty restricted as far as the core elements of Atom are concerned. Secondly, the query and updating operations are not nearly as comprehensive as SQL. If you want to point to SQL as the only reasonable way to query over gigabytes of data, I'll just point to Google or Yahoo as counter examples.

It's Not All Puppies and Roses

Now that I've got you all worked into a lather over how great the world will be with Atom Stores on every street corner, let me splash a little cold water in your direction. I've kind of glossed over some areas that need work. Some of the open questions are:

More from
The Restful Web

Implementing the Atom Publishing Protocol

httplib2: HTTP Persistence and Authentication

Doing HTTP Caching Right: Introducing httplib2

Catching Up with the Atom Publishing Protocol

Dispatching in a REST Protocol Application

Indexing
Does indexing have to be immediate for the idea to be beneficial?
Annotating
How do you know where to POST to for creating new entries vs. annotations?
Creation
If I POST a new Entry to an aggregate of a bunch of Atom Stores, which of those Atom Stores should it be created in? How should I route that POST?
Foreign Markup
Let's say I wanted to use an Atom Store for storing all the customer transactions in my e-commerce store. To do that effectively I may have to add some extra information to an Atom Entry to fully represent a transaction. How and where is that information stored and indexed? Do I start creating microformats for all of that data or do I stuff it in the Entry as foreign markup? How much indexing of foreign markup is useful? Do we need specialized indexing and search terms for that?

As you can see there's plenty of work to be done. Let's roll up our sleeves and make it happen.