XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Weblogs, Publish-Subscribe, and Web Collections: A REST Analysis

December 01, 2004

I've been exploring publish-subscribe and web-based notifications for quite some time. A few years ago, I started a project called SearchAlert in order to learn directly what it might take to do notifications at the scale of the web. Although the service hasn't evolved into anything I would call large scale, it has given me a chance to write software to experiment with web-based notifications. As a weekend project, I didn't expect SearchAlert to become a company, and when Google started to offer free email notifications of new search results, that pretty much put an end to any notifications-as-business idea. At that time, SearchAlert was primarily providing email notifications of new search results, and I knew that my little experiment wasn't going to go much further along those lines.

Recently, I've started looking at what PubSub is doing--its recent performance numbers sound intriguing (something like 2.4 million "matches" per second). There are other great blog-based services such as Feedster, Technorati, and, of course, Blogdex. In order to understand what these services have in common, I've tried to apply REST architectural concepts--specifically, resource modeling--to PubSub, SearchAlert, Feedster, and Google alerts; then, I try to predict where these will go next and who else might get into the game. There are other great subscription and notification services out there, and I chose to focus on these for no particular reason. The first section of this article deals with user notifications, and I describe blog notifications later in the article.

First, PubSub provides subscription and notification technology for notifying users about weblogs, newsgroups, and Edgar fillings. Next, SearchAlert provides subscription and notification technology for notifying users about web search results. Feedster provides subscription and notification technology for notifying users about weblogs. And, last, Google provides subscription and notification technology for notifying users about web search results. Even though many sites and companies talk about publish-subscribe technology with respect to blogs, web pages, and other sources, there usually isn't any "publish" in the technology. This isn't a bad thing, but it's more accurate to describe these as subscription and notification services. I'll say more below about why I think subscription and notifications services are currently enjoying more success than generalized publish-subscribe services.

Let's break things down a bit more. In the area of subscribing, PubSub supports subscriptions over several sets of data: weblogs, newsgroups, Edgar financial fillings, airport delays, and press releases. Feedster supports subscriptions only over weblogs, which they call "dynamic feeds." Both SearchAlert and Google support subscriptions over two sets of data: general web search results and web search results focused on "news."

Subscription management varies across these three providers: PubSub will list your subscriptions and give you edit and delete capabilities, and SearchAlert will also list your subscriptions and provide edit and delete capabilities. Feedster is a little hard to figure out; it doesn't appear to have subscription management other than deleting dynamic feeds. Google recently added a way of managing subscriptions and modifying them, and you can cancel a specific subscription from a link within the email notification generated from that subscription.

For notifications, the PubSub service supports Jabber-based IM and an RSS or Atom feed hosted on its site. The SearchAlert service supports daily or immediate notifications via email as well as proactive web notifications, that is, sending XML to a URI of your choice (this includes Atom formatted XML to your blog, Weblogger API calls, etc). Also, SearchAlert has search results in RSS, but it isn't a generally advertised feature, and trying to scale the load would cost too much at this point. Feedster supports an RSS feed hosted on its site as well as email notifications. Google supports daily or immediate email notifications.

I was initially hesitant to call an Atom or RSS feed a true notification mechanism because these are accessed by polling and not proactively sent by the service. However, nothing prevents the content of these feeds from being sent the moment the service is aware of a change. The SearchAlert service does exactly this via HTTP or SMTP, and PubSub does it via the XMPP protocol used by Jabber-based instant messaging systems.

So how does REST and resource modeling fit into all this, and how RESTful are these different approaches? In each of these systems, a user provides search terms, and the system sends notifications about the search results, which is essentially a "saved search." There are two resources: the search and the collection of search results. It's the collection of search results that is key. The notification system pays attention to this resource. In other words, items added to this collection will generate a notification. Theoretically, items removed or resequenced could also generate notifications. These systems are basically a large search index where the results are available as XML.

One aspect of the REST architectural style is that a resource does not have to be explicitly created before it is accessed; it merely needs to be identified. This allows for a really large number of resources with no prior interactions between clients and servers, and the only up-front coordination is to determine how the identifier is formed. Another aspect of REST is that resources have representations, sometimes multiple representations for the same resource. Currently, collections of search results have different representations between different providers. Google's first choice was to provide search results in an HTML representation, which is, after all, how it became rich and famous. For PubSub, it's Atom-flavored XML. From a REST point of view, these are equivalent and interchangeable.

The search results resource is the most interesting because it is very similar to a weblog, which is simply a collection of items. An RSS or Atom feed is a format for a list of items. (Simple HTML with unordered-lists and list-items is too, but where's the fun in that?) Blogs are generated manually by an author or editor, while feeds are (typically) generated automatically. Search results are also just lists of items--that's what PubSub does for blogs and Google does for the web and for news. And, it's what Amazon does with popular products. All of these are just web resources that are search results generated from a (conceivably very large) search index.

I think PubSub is in for some serious competition because creating, hosting, and serving up very large search indices are the bases for Google and Amazon's core applications. There are companies such as Technorati that do this just for blogs. It would be straightforward for Technorati or Google to provide search results in RSS and Atom. Presto: instant competition. While doing this investigation, I was struck by the popularity of application-specific subscription and notification services such as technorati.com, PubSub, Google alerts, and others, especially by way of comparison to generalized publish-subscribe services like mod-pubsub.org. I was unable to find a good technical reason for this. The particular notification techniques and protocols didn't seem to matter--some use pro-active HTTP notifications, others poll via HTTP, some use SMTP, and others use XMPP. I've come to the conclusion that, in this case as in so many others, content is king. The value of this emerging network is the available selection, and both developers and customers don't particularly care how they get it, just as long as they can get it.

So How RESTful Are These Approaches?

Here's what I've found so far:

  • Google supports interacting with a resource for search results in one step: merely identify the resource by putting the search terms in the URI. Both RESTful and useful. Suggestion: add a "view as XML" link on the search results page.
  • Google's subscription management pages allow you to switch notification formats between text and HTML by visiting a link. Not RESTful. A utility that fetches web pages based on links would modify your configuration.
  • Google's subscription management pages allows you to delete a subscription by visiting a link, although they attempt to protect it via an onClick Javascript handler. Not safe and not RESTful. A utility that fetches web pages based on links might delete all of your subscriptions, but I'm not sure how that utility would learn about this URI. Suggestion: change the "delete" link on the subscription management page to use an HTML form.
  • Google's email notifications have a URI to cancel your subscription, and merely visiting the page cancels the subscription. Not RESTful and double plus ungood. A utility that automatically pre-fetches pages referenced in your email would silently cancel these subscriptions. I wonder if the Google Desktop application that indexes local email would trigger this undesired behavior? Suggestion: have the "cancel subscription" link point to a confirmation page that uses an HTML form.
  • Feedster supports interacting with a resource for search results in one step; again, put the search terms in the URI. Both RESTful and useful. Bonus points for having auto-discovery of an RSS rendition of the search results.
  • Technorati supports interacting with a resource for search results in one step: put the search terms in the URI. Both RESTful and useful. Suggestion: for Technorati, add auto-discovery and a "view as XML" link on the search results page similar to what Feedster does.
  • Technorati supports creating a watch list but only for certain types of search results (for example, "marsrover.com" but not "mars rover"), and a separate URI is created for these search results. RESTful but not useful. Suggestion: omit the extra step of creating a separate identifier for this resource.
  • Both SearchAlert and PubSub require a two-step process to access search results: submit the search terms and a magic URI is created. RESTful but not the most useful. Suggestion: omit the extra step of creating a separate identifier for this resource.
  • PubSub provides an XML representation of search results, and it also uses client-side style sheets to display the results as HTML in a modern browser. Nicely RESTful and extra credit for using client-side processing in a standards-compliant way.
  • Amazon.com provides RSS for popular products. RESTful but not useful--I'm not interested in what other people are interested in; I want my search results. Suggestion: support RSS for recommendations and general search results via auto-discovery and a "view as XML" link on search results pages.

Weblog Notifications

Several years ago, key developers in the blogging community created two notification approaches called TrackBack and Pingback. These provide ways for blog authors to notify other blogs that there are new links to them; and if other blogs do the same, it allows an author to be notified of new links to her or his blog. The notification can also be sent to a completely different service, for example, one that collects notifications from all blogs. This notification aggregation service can use these messages to maintain a search index of blogs that is more up-to-date than one built purely from crawling all blogs.

The term used in the blogging community for notifications is "ping," which unfortunately for me has always meant "get current status" rather than "send current status." I guess I'll just have to adapt. There are different types of weblog notifications, but each one is similar in an important way: they are requests to update the state of various types of web resources. The types of resources used by blog notifications may be described as follows:

  • Add to the collection of comments for a blog entry.
  • Add to a collection of related items for a blog entry.
  • Add to a collection of subject-specific blog entries.

Blog notifications are very similar in concept to the original message that updated the author's own blog. The general concept of blog management and blog notification is one of updating the state of a web resource that is a collection of items. A very useful part of the blog notifications system built by the community is auto-discovery. This allows client applications to learn where to route messages about a blog entry. Although it would be possible to send special messages to a blog entry directly, developers realized that it would be simpler, more general, and more flexible to have separate but related resources for this purpose. This allows developers to create resources for a collection of comments, a list of referring sites, etc. One aspect of REST is to use hypertext as the engine of application state, and the capability of auto-discovery is a great example of this. The tags within the retrieved content inform the client of available resources and help set expectations about the available actions.

How RESTful Are These Approaches?

  • TrackBack sends notifications with pure HTTP. RESTful.
  • Pingback sends notifications with XML-RPC. Not very RESTful, but arguably useful for some developers.
  • TrackBack supports auto-discovery via embedding RDF within an entry or feed. RESTful.
  • Pingback uses a link tag within an HTML page or an HTTP response header. Very RESTful. An additional benefit of using the more general approach of an HTTP response header is the ability to apply notifications to non-HTML resources such as JPEG resources. A Pingback-compatible server could expose and collect comments about pictures or PDF files, etc.


1 to 2 of 2
  1. HTTP Subscription
    2005-09-09 08:35:21 fuzzyBSc
  2. Some corrections to claims about PubSub
    2004-12-02 00:08:12 Bob Wyman
1 to 2 of 2