Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Weblogs, Publish-Subscribe, and Web Collections: A REST Analysis

by Mike Dierken
December 01, 2004

I've been exploring publish-subscribe and web-based notifications for quite some time. A few years ago, I started a project called SearchAlert in order to learn directly what it might take to do notifications at the scale of the web. Although the service hasn't evolved into anything I would call large scale, it has given me a chance to write software to experiment with web-based notifications. As a weekend project, I didn't expect SearchAlert to become a company, and when Google started to offer free email notifications of new search results, that pretty much put an end to any notifications-as-business idea. At that time, SearchAlert was primarily providing email notifications of new search results, and I knew that my little experiment wasn't going to go much further along those lines.

Recently, I've started looking at what PubSub is doing--its recent performance numbers sound intriguing (something like 2.4 million "matches" per second). There are other great blog-based services such as Feedster, Technorati, and, of course, Blogdex. In order to understand what these services have in common, I've tried to apply REST architectural concepts--specifically, resource modeling--to PubSub, SearchAlert, Feedster, and Google alerts; then, I try to predict where these will go next and who else might get into the game. There are other great subscription and notification services out there, and I chose to focus on these for no particular reason. The first section of this article deals with user notifications, and I describe blog notifications later in the article.

Related Reading

XML in a Nutshell
By Elliotte Rusty Harold, W. Scott Means


Read Online--Safari Search this book on Safari:
 

Code Fragments only

First, PubSub provides subscription and notification technology for notifying users about weblogs, newsgroups, and Edgar fillings. Next, SearchAlert provides subscription and notification technology for notifying users about web search results. Feedster provides subscription and notification technology for notifying users about weblogs. And, last, Google provides subscription and notification technology for notifying users about web search results. Even though many sites and companies talk about publish-subscribe technology with respect to blogs, web pages, and other sources, there usually isn't any "publish" in the technology. This isn't a bad thing, but it's more accurate to describe these as subscription and notification services. I'll say more below about why I think subscription and notifications services are currently enjoying more success than generalized publish-subscribe services.

Let's break things down a bit more. In the area of subscribing, PubSub supports subscriptions over several sets of data: weblogs, newsgroups, Edgar financial fillings, airport delays, and press releases. Feedster supports subscriptions only over weblogs, which they call "dynamic feeds." Both SearchAlert and Google support subscriptions over two sets of data: general web search results and web search results focused on "news."

Subscription management varies across these three providers: PubSub will list your subscriptions and give you edit and delete capabilities, and SearchAlert will also list your subscriptions and provide edit and delete capabilities. Feedster is a little hard to figure out; it doesn't appear to have subscription management other than deleting dynamic feeds. Google recently added a way of managing subscriptions and modifying them, and you can cancel a specific subscription from a link within the email notification generated from that subscription.

For notifications, the PubSub service supports Jabber-based IM and an RSS or Atom feed hosted on its site. The SearchAlert service supports daily or immediate notifications via email as well as proactive web notifications, that is, sending XML to a URI of your choice (this includes Atom formatted XML to your blog, Weblogger API calls, etc). Also, SearchAlert has search results in RSS, but it isn't a generally advertised feature, and trying to scale the load would cost too much at this point. Feedster supports an RSS feed hosted on its site as well as email notifications. Google supports daily or immediate email notifications.

I was initially hesitant to call an Atom or RSS feed a true notification mechanism because these are accessed by polling and not proactively sent by the service. However, nothing prevents the content of these feeds from being sent the moment the service is aware of a change. The SearchAlert service does exactly this via HTTP or SMTP, and PubSub does it via the XMPP protocol used by Jabber-based instant messaging systems.

So how does REST and resource modeling fit into all this, and how RESTful are these different approaches? In each of these systems, a user provides search terms, and the system sends notifications about the search results, which is essentially a "saved search." There are two resources: the search and the collection of search results. It's the collection of search results that is key. The notification system pays attention to this resource. In other words, items added to this collection will generate a notification. Theoretically, items removed or resequenced could also generate notifications. These systems are basically a large search index where the results are available as XML.

One aspect of the REST architectural style is that a resource does not have to be explicitly created before it is accessed; it merely needs to be identified. This allows for a really large number of resources with no prior interactions between clients and servers, and the only up-front coordination is to determine how the identifier is formed. Another aspect of REST is that resources have representations, sometimes multiple representations for the same resource. Currently, collections of search results have different representations between different providers. Google's first choice was to provide search results in an HTML representation, which is, after all, how it became rich and famous. For PubSub, it's Atom-flavored XML. From a REST point of view, these are equivalent and interchangeable.

The search results resource is the most interesting because it is very similar to a weblog, which is simply a collection of items. An RSS or Atom feed is a format for a list of items. (Simple HTML with unordered-lists and list-items is too, but where's the fun in that?) Blogs are generated manually by an author or editor, while feeds are (typically) generated automatically. Search results are also just lists of items--that's what PubSub does for blogs and Google does for the web and for news. And, it's what Amazon does with popular products. All of these are just web resources that are search results generated from a (conceivably very large) search index.

I think PubSub is in for some serious competition because creating, hosting, and serving up very large search indices are the bases for Google and Amazon's core applications. There are companies such as Technorati that do this just for blogs. It would be straightforward for Technorati or Google to provide search results in RSS and Atom. Presto: instant competition. While doing this investigation, I was struck by the popularity of application-specific subscription and notification services such as technorati.com, PubSub, Google alerts, and others, especially by way of comparison to generalized publish-subscribe services like mod-pubsub.org. I was unable to find a good technical reason for this. The particular notification techniques and protocols didn't seem to matter--some use pro-active HTTP notifications, others poll via HTTP, some use SMTP, and others use XMPP. I've come to the conclusion that, in this case as in so many others, content is king. The value of this emerging network is the available selection, and both developers and customers don't particularly care how they get it, just as long as they can get it.

So How RESTful Are These Approaches?

Here's what I've found so far:

  • Google supports interacting with a resource for search results in one step: merely identify the resource by putting the search terms in the URI. Both RESTful and useful. Suggestion: add a "view as XML" link on the search results page.
  • Google's subscription management pages allow you to switch notification formats between text and HTML by visiting a link. Not RESTful. A utility that fetches web pages based on links would modify your configuration.
  • Google's subscription management pages allows you to delete a subscription by visiting a link, although they attempt to protect it via an onClick Javascript handler. Not safe and not RESTful. A utility that fetches web pages based on links might delete all of your subscriptions, but I'm not sure how that utility would learn about this URI. Suggestion: change the "delete" link on the subscription management page to use an HTML form.
  • Google's email notifications have a URI to cancel your subscription, and merely visiting the page cancels the subscription. Not RESTful and double plus ungood. A utility that automatically pre-fetches pages referenced in your email would silently cancel these subscriptions. I wonder if the Google Desktop application that indexes local email would trigger this undesired behavior? Suggestion: have the "cancel subscription" link point to a confirmation page that uses an HTML form.
  • Feedster supports interacting with a resource for search results in one step; again, put the search terms in the URI. Both RESTful and useful. Bonus points for having auto-discovery of an RSS rendition of the search results.
  • Technorati supports interacting with a resource for search results in one step: put the search terms in the URI. Both RESTful and useful. Suggestion: for Technorati, add auto-discovery and a "view as XML" link on the search results page similar to what Feedster does.
  • Technorati supports creating a watch list but only for certain types of search results (for example, "marsrover.com" but not "mars rover"), and a separate URI is created for these search results. RESTful but not useful. Suggestion: omit the extra step of creating a separate identifier for this resource.
  • Both SearchAlert and PubSub require a two-step process to access search results: submit the search terms and a magic URI is created. RESTful but not the most useful. Suggestion: omit the extra step of creating a separate identifier for this resource.
  • PubSub provides an XML representation of search results, and it also uses client-side style sheets to display the results as HTML in a modern browser. Nicely RESTful and extra credit for using client-side processing in a standards-compliant way.
  • Amazon.com provides RSS for popular products. RESTful but not useful--I'm not interested in what other people are interested in; I want my search results. Suggestion: support RSS for recommendations and general search results via auto-discovery and a "view as XML" link on search results pages.

Weblog Notifications

Several years ago, key developers in the blogging community created two notification approaches called TrackBack and Pingback. These provide ways for blog authors to notify other blogs that there are new links to them; and if other blogs do the same, it allows an author to be notified of new links to her or his blog. The notification can also be sent to a completely different service, for example, one that collects notifications from all blogs. This notification aggregation service can use these messages to maintain a search index of blogs that is more up-to-date than one built purely from crawling all blogs.

The term used in the blogging community for notifications is "ping," which unfortunately for me has always meant "get current status" rather than "send current status." I guess I'll just have to adapt. There are different types of weblog notifications, but each one is similar in an important way: they are requests to update the state of various types of web resources. The types of resources used by blog notifications may be described as follows:

  • Add to the collection of comments for a blog entry.
  • Add to a collection of related items for a blog entry.
  • Add to a collection of subject-specific blog entries.

Blog notifications are very similar in concept to the original message that updated the author's own blog. The general concept of blog management and blog notification is one of updating the state of a web resource that is a collection of items. A very useful part of the blog notifications system built by the community is auto-discovery. This allows client applications to learn where to route messages about a blog entry. Although it would be possible to send special messages to a blog entry directly, developers realized that it would be simpler, more general, and more flexible to have separate but related resources for this purpose. This allows developers to create resources for a collection of comments, a list of referring sites, etc. One aspect of REST is to use hypertext as the engine of application state, and the capability of auto-discovery is a great example of this. The tags within the retrieved content inform the client of available resources and help set expectations about the available actions.

How RESTful Are These Approaches?

  • TrackBack sends notifications with pure HTTP. RESTful.
  • Pingback sends notifications with XML-RPC. Not very RESTful, but arguably useful for some developers.
  • TrackBack supports auto-discovery via embedding RDF within an entry or feed. RESTful.
  • Pingback uses a link tag within an HTML page or an HTTP response header. Very RESTful. An additional benefit of using the more general approach of an HTTP response header is the ability to apply notifications to non-HTML resources such as JPEG resources. A Pingback-compatible server could expose and collect comments about pictures or PDF files, etc.

Comment on this articleShare your experience in our forums.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Titles Only Newest First
  • HTTP Subscription
    2005-09-09 08:35:21 fuzzyBSc [Reply]

    Hello,


    I've been working on trying to standardise an approach for RESTful subscription based on HTTP streaming at RestWiki. Perhaps it will be of some interest to you.


    [1] http://rest.blueoxen.net/cgi-bin/wiki.pl?HttpSubscription


    Benjamin.

  • Some corrections to claims about PubSub
    2004-12-02 00:08:12 Bob Wyman [Reply]

    Thanks for taking the time to study PubSub.com and provide your comments. I would, however, like to point out a few inaccuracies in your note:


    1. You understate PubSub's throughput capacity by three orders of magnitude! We benchmark at 3 billion matches per second -- not the mere 2.4 million/second that you claim! If PubSub could only handle a few million matches per second, PubSub wouldn't be a very useful system. An Internet Scale system must be able to handle millions of complex subscriptions that are each being matched against at least hundreds of messages per second. Matching rates of billions per second are a *minimal* requirement for Internet Scale matching engines.


    2. In addition to providing notifications via "Atom over XMPP/Jabber", RSS and Atom files, PubSub also supports email notification for SEC Edgar alerts, press releases, and Airport Alerts. (We don't support email for Weblog or Newsgroup updates simply because volumes could too easily become excessive for users.)


    3. PubSub does support "REST" notifications -- i.e. Atom entries which are POSTed using pure HTTP. However, we have found no interest in the community for using these things (primarily because they can't punch through firewalls) and thus have given up on publicizing our REST support. We continue to support those few people who have been using these REST alerts, but don't anticipate finding many new users. Note: We also implemented SOAP based notification but couldn't find anyone interested in even testing it. Notifications are most valuable when they reach to the desktop (not just to servers) but the desktop is typically shielded by firewalls.


    4. It is odd that you suggest that PubSub is not a "generalized" pubsub system. Actually, that is precisely what we have built. However, at this time, we only choose to expose the specific applications that we have built using our very generalized system. The money is in the applications -- not tools or technologies. We'll provide access to the general system later but need to focus on the applications for now.


    5. Your suggestion that PubSub eliminate the "two step" subscription process of 1) Specify subscription query and 2) receive result URI, ignores a number of important elements of our service. First, doing what you suggest would require that no subscription could be more complex then what can be packed into a limited-size URI. This is a significant restriction. We support general boolean queries with potentialy dozens of terms or predicates. A subscription can easily become much larger than what can be stored in a URI. Also, to rely on a user provided URI would require that user-specific information be included in the URI. That would make it impossible for users to "share" the URI's for the subscriptions they generate.


    Thanks again for taking the time to review what we've done. If you have any more ideas on how we might improve our service, please let me know.


    bob wyman
    CTO, PubSub.com

    • Some corrections to claims about PubSub
      2004-12-02 22:07:45 Mike Dierken [Reply]

      1. I apologize for the incorrect metrics of pubsub.com matching rates - I thought I got it right off of your weblog at http://bobwyman.pubsub.com/main/2004/06/hyperbole_numbe.html. (Given 225B/day I still get 2.6M per second).


      3. You are right, I didn't realize pubsub.com supported HTTP POST. I would be interested in using this capability, since the mod-pubsub.org system has the ability receive POSTs and to forward messages to clients and desktops. It would be interested to see the two systems hooked up together.


      4. Sorry for giving the impression that I thought pubsub.com was not a generalized system - I only meant to describe the data sources currently supported. And I definitely agree that the money is in the applications rather than the technologies.


      5. My impression is that most queries are simple enough to express in a URI. More complex ones are the edge case that don't have to follow the one-step creation process. Also, I don't know what user-specific data would be placed in the query that would wind up in the URI, but it definitely would hinder sharing URIs.


      All in all, I think pubsub.com is a great system and I have the utmost respect for you and your team - keep up the good work of bringing publish/subscribe technologies to the Web.

      • Some corrections to claims about PubSub
        2004-12-03 01:25:09 Bob Wyman [Reply]

        Mike,
        1. Yes, you are right. The most we've ever needed to do in production is a few million matches per second. The 3 billion/second number is what we get in testing.


        3. re: mod-pubsub support. We used a slightly modified version of our REST support to feed messages to KnowNow LiveServers. I believe the mod-pubsub interfaces are very similar to KnowNow's. The KnowNow/mod-pubsub technology is very useful since it allows us to establish a light-weight, persistent, firewall-piercing connection to the desktop. Our focus is on the matching problem and we're pleased to be able to leverage existing solutions to actually do delivery of messages. Let's talk offline in email about working with mod-pubsub.


        5. It would take too long to explain in a comment, however, let me just say that a number of methods for computing "relevance" of matches rely on examining the history of messages that have been delivered to a particular subscription over time as well as on user feedback. Thus, even if two subscriptions have identical queries, they may deliver different results based on when they were created and the user's history of interaction with the results. If we don't provide a binding between a user and a subscription, we will be severely limited in our ability to implement a whole class of improved methods for determining the "relevance" of a matched item. This would not be a good thing. The "single step" solution that you propose works very well with retrospective searches (i.e. what Google, Feedster, etc. do) where the entire result set is available each time the query is re-evaluated. However, this solution is much less useful in a "prospective" system like we implement since in such a system, the result set accumulates over time and we can benefit from user interaction and hinting over time.


        bob wyman