Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

All Consuming Web Services

by Erik Benson
May 27, 2003

By taking small steps, first consuming information from multiple web services, and then exposing newly processed information via your own web services, we can begin to build complex applications in our spare time, with very few resources required up front. All Consuming is one such application that's built on top of free services offered by weblogs.com, blo.gs, Google, and Amazon; it offers an interesting slice of the book life that exists on the Web and in the world. None of these services were built with All Consuming in mind, and yet each one plays a crucial role in supporting All Consuming, and benefits from doing so.

How to Build Interesting Book Lists

All Consuming, inspired by Paul Bausch's work with Book Watch, is a site dedicated to providing interesting book lists. I wanted to know what people on the Web were reading without explicitly asking them. I didn't want to know what people were buying, necessarily, nor what editors thought they should be reading, but what people were actually reading, actually talking about, and actually engaged by. To that end I propose, for comparison, the following admittedly ridiculous solution: we could install sensitive microphones and video cameras in every public place to record all conversations and all written correspondence between individuals.

We could send all this data to a centralized location to view it, sort it, and analyze it quickly and easily. To do that, we'd need help, so I propose that we rent an old warehouse and in it gather a team of a thousand employees to dutifully review all of these video and audio files for each and every mention of a book twenty-four hours a day, seven days a week. Whenever they heard or saw mention of a book, the employees could write the title down on a slip of paper and place that slip of paper in a box. At the end of the day, they'd consolidate all the slips of paper into piles, one for each book, and see which pile was the tallest, the second tallest, the third tallest, etc. Voila, a list of books that are currently being talked about in our community.

Luckily, there's a better way to do this, though the end result is the same. With technologies like SOAP and XML, and the abundance of free software and cheap hardware, there are solutions that can be implemented by a single person with a simple web hosting account in their spare time over a couple weekends. The Most Mentioned Books list on the homepage of All Consuming is essentially the same as would result from the previously mentioned process -- it's a list of books that people are writing about and discussing on the Web. However, instead of having to tape and record what everyone is doing at every moment, hundreds of thousands of people around the world are recording all the information I need in a way that is accessible to me, via their weblogs, and it just so happens that the standards and tools are available so that I can tell (most of the time) when they mention a book, and I can also easily retrieve product information about those books. And, best of all, this can be done with a Perl script that's only a couple hundred lines long and without the additional help of any people (i.e., nobody has to find and enter book titles, or compare piles of paper slips, or print up the list and nail it to the nearest pole).

When people who aren't familiar with web services ask me what All Consuming is, they zone out before I can say XML. At its core, though, it's a response to a very simple phenomenon: the data is out there, the data is free, and the data is extremely interesting. I happen to be looking at book data, but it's true for all kinds. Until recently, this hasn't been the case. If you wanted to find out what a large group of people were doing, you would have to find a proxy to that information, some general indicator of what the group was doing via book sales or Nielsen ratings or random polls. With a lot of the things happening on the Web, though, this is changing. Enormous amounts of information are being placed online, universally accessible and machine-readable.

All Consuming is an extremely thin application surviving on very limited resources that lives on top of a foundation of wonderfully accessible web services, making it possible to combine and recombine enormous amounts of information in novel, often insightful ways. I've chosen to focus on book data because it's of particular interest to me, but others have taken similar approaches to looking at link data (Blogdex, Technorati) and news (Google News, Daypop), and there are still almost endless opportunities to look at a million different slices of the data.

Finding the Data That's Out There

Here's how All Consuming works. Every hour a Perl script checks Weblogs.com's Recently Updated list for the weblogs that were updated during the last hour. Since the Recently Updated list is available in XML, the script is able to separate the information that it needs from the information it doesn't need very easily. The information it needs is simply a list of URLs, each of which the script then visits and reads. When reading each site that it visits, it looks for text that matches a certain pattern that signifies a link to Amazon, Book Sense, Barnes & Noble, or even All Consuming. It doesn't matter to the script which site you link to, since they all use ISBNs (International Standard Book Numbers) as book IDs in their URLs. Upon finding a link it recognizes, it saves the ISBN along with an excerpt of the paragraph or so of text that the link appeared in. Here's an example of the XML that might be saved during an hour that it only found one book:

<opt>
  <header 
	lastBuildDate="Sat Mar 15 13:30:02 2003" 
	title="All Consuming" 
	language="en-us" 
	description="Most recent books being talked about by webloggers." 
	link="http://allconsuming.net/" 
	number_updated="172" 
  />
  <asins 
	asin="0465045669" 
	title="Metamagical Themas: Questing for the Essence of Mind and Pattern" 
	author="Douglas R. Hofstadter" 
	url="http://www.erikbenson.com/"
	image="http://images.amazon.com/images/P/0465045669.01.THUMBZZZ.jpg" 
	excerpt="Douglas Hoftstadter's lesser-known book, Metamagical Themas, 
	  has a great chapter or two on self-referential sentences like 'This 
	  sentence was in the past tense.'." 
	amazon_url="http://amazon.com/exec/obidos/ASIN/0465045669/"
	allconsuming_url="http://allconsuming.net/item.cgi?id=0465045669"
  />
</opt>

Each hour it's typical that the script might search between two hundred and a thousand newly updated sites, finding up to a thousand unique ISBN/URL pairs. It writes information to a file for each book mention that includes all the information we need to know about it, discarding the rest. This script takes between ten minutes to a half an hour to run, depending on the number of weblogs that have updated and the level of traffic on my server.

After it's done, another script starts up that reads the file that the first script wrote to and loads up (from an RDBMS) all of the sites that I know to have mentioned those books in the past. Because most of these sites have several days worth of entries on them, when they update on the most recent day I'll still find books that I know they mentioned earlier in the week. I don't care about them as much, though, so I remove duplicates from these two lists, and I'm left with a short list of books that were mentioned for the first time at any given URL during the last hour. Of the original thousand book mentions or so, between five and fifteen of them might be brand new ones. I store those five to ten ISBNs into the database and create an XML and RSS file (as shown above) so that others can reuse this data in their own applications. Here's a link to the current XML and RSS files at All Consuming for new books mentioned this last hour: XML and RSS.

Finally, a third script loads from the database every book mention from the last week, giving every book that has been mentioned on every site a score. Book mentions from the current day get a score of seven, book mentions from yesterday a score of six, and so on until book mentions from six days ago get a score of one. I sort all of these books by score, and display the top thirty books mentioned on weblogs within the last week. It's weighted to favor books that were mentioned recently, so the really fresh stuff floats to the top. All of that is also saved as XML and RSS in case anyone wants to use that data and build something else with it.

Incidentally, the front page of All Consuming relies on this same XML feed that's available to everyone else to display the books. I do this so that I'm the first one to know when it breaks. With that list of book ISBNs, I can use Amazon's web services to find out the book's author, title, cover image, average customer review, and price. Getting this information from Amazon is as simple as generating a URL with the ISBN and your free developer's token (so Amazon can track your usage and make sure you're not abusing the service), and then parsing the XML that it returns. Amazon has a SOAP interface, too, but at the moment I think it's a little bit of an overkill. I prefer just to work with the straight XML.

Making the Data Interesting

What I've described so far can be likened to the first step in any creative process, namely, gathering material. We've used automated scripts and web services to collect a very specific type of data from a very large pool of information, like collecting blackberries from a forest of various types of trees and shrubbery. Now it's possible to take that data, those berries, work with them in a context outside of the shrubbery and dirt and chaos, and make a delicious pie.

Book-Centric Pages

We know what books people are talking about the most, but what if you just want to know what people are saying about a particular book? That's just a different slice of the data that we already have. All Consuming has a page for each book in Amazon's catalog (basically every book in print in America and the UK), and it lists every weblog that we know to have mentioned it since the site first launched last August. Instantly you can get a sense for how much any book (take the currently popular book by weblogger Cory Doctorow, Down And Out In the Magic Kingdom) has been talked about, when it was being talked about, and by whom.

Other information relevant to the book at hand can be hung from this page. For example, we also display books similar to the current one (with an indicator of how many weblogs have mentioned them), user comments (which I'll get to soon), and any other information that users feel want to attribute to the book (first sentences, links to related sites, number of pages, etc).

Weblog-Centric Pages

There are two different types of users at All Consuming, and one is a subset of the other. First, there are all the weblogs out there that I've discovered through the automated hourly script. Within that set is the group of all people who have visited All Consuming and have chosen to create a user account. Part of the registration process is to have them let me know where there weblog is, if they happen to have one.

I realized that we could slice the data another way by looking at it from the perspective of a particular weblog (here's my weblog's page on All Consuming for erikbenson.com). Every weblog that we know about has a page dedicated to it and a list of books that it has mentioned.

People who've explicitly signed up at All Consuming can create their own booklists directly on the site. There are lists for what they're currently reading, what they're reading next, what they've completed, and what their favorite books of all time are. User accounts allow people to attach more specific motivations and intentions to their book talk than I could otherwise glean from just scraping their site.

Pages: 1, 2

Next Pagearrow