All Consuming Web Services
May 27, 2003
By taking small steps, first consuming information from multiple web services, and then exposing newly processed information via your own web services, we can begin to build complex applications in our spare time, with very few resources required up front. All Consuming is one such application that's built on top of free services offered by weblogs.com, blo.gs, Google, and Amazon; it offers an interesting slice of the book life that exists on the Web and in the world. None of these services were built with All Consuming in mind, and yet each one plays a crucial role in supporting All Consuming, and benefits from doing so.
How to Build Interesting Book Lists
All Consuming, inspired by Paul Bausch's work with Book Watch, is a site dedicated to providing interesting book lists. I wanted to know what people on the Web were reading without explicitly asking them. I didn't want to know what people were buying, necessarily, nor what editors thought they should be reading, but what people were actually reading, actually talking about, and actually engaged by. To that end I propose, for comparison, the following admittedly ridiculous solution: we could install sensitive microphones and video cameras in every public place to record all conversations and all written correspondence between individuals.
We could send all this data to a centralized location to view it, sort it, and analyze it quickly and easily. To do that, we'd need help, so I propose that we rent an old warehouse and in it gather a team of a thousand employees to dutifully review all of these video and audio files for each and every mention of a book twenty-four hours a day, seven days a week. Whenever they heard or saw mention of a book, the employees could write the title down on a slip of paper and place that slip of paper in a box. At the end of the day, they'd consolidate all the slips of paper into piles, one for each book, and see which pile was the tallest, the second tallest, the third tallest, etc. Voila, a list of books that are currently being talked about in our community.
Luckily, there's a better way to do this, though the end result is the same. With technologies like SOAP and XML, and the abundance of free software and cheap hardware, there are solutions that can be implemented by a single person with a simple web hosting account in their spare time over a couple weekends. The Most Mentioned Books list on the homepage of All Consuming is essentially the same as would result from the previously mentioned process -- it's a list of books that people are writing about and discussing on the Web. However, instead of having to tape and record what everyone is doing at every moment, hundreds of thousands of people around the world are recording all the information I need in a way that is accessible to me, via their weblogs, and it just so happens that the standards and tools are available so that I can tell (most of the time) when they mention a book, and I can also easily retrieve product information about those books. And, best of all, this can be done with a Perl script that's only a couple hundred lines long and without the additional help of any people (i.e., nobody has to find and enter book titles, or compare piles of paper slips, or print up the list and nail it to the nearest pole).
When people who aren't familiar with web services ask me what All Consuming is, they zone out before I can say XML. At its core, though, it's a response to a very simple phenomenon: the data is out there, the data is free, and the data is extremely interesting. I happen to be looking at book data, but it's true for all kinds. Until recently, this hasn't been the case. If you wanted to find out what a large group of people were doing, you would have to find a proxy to that information, some general indicator of what the group was doing via book sales or Nielsen ratings or random polls. With a lot of the things happening on the Web, though, this is changing. Enormous amounts of information are being placed online, universally accessible and machine-readable.
All Consuming is an extremely thin application surviving on very limited resources that lives on top of a foundation of wonderfully accessible web services, making it possible to combine and recombine enormous amounts of information in novel, often insightful ways. I've chosen to focus on book data because it's of particular interest to me, but others have taken similar approaches to looking at link data (Blogdex, Technorati) and news (Google News, Daypop), and there are still almost endless opportunities to look at a million different slices of the data.
Finding the Data That's Out There
Here's how All Consuming works. Every hour a Perl script checks Weblogs.com's Recently Updated list for the weblogs that were updated during the last hour. Since the Recently Updated list is available in XML, the script is able to separate the information that it needs from the information it doesn't need very easily. The information it needs is simply a list of URLs, each of which the script then visits and reads. When reading each site that it visits, it looks for text that matches a certain pattern that signifies a link to Amazon, Book Sense, Barnes & Noble, or even All Consuming. It doesn't matter to the script which site you link to, since they all use ISBNs (International Standard Book Numbers) as book IDs in their URLs. Upon finding a link it recognizes, it saves the ISBN along with an excerpt of the paragraph or so of text that the link appeared in. Here's an example of the XML that might be saved during an hour that it only found one book:
<opt> <header lastBuildDate="Sat Mar 15 13:30:02 2003" title="All Consuming" language="en-us" description="Most recent books being talked about by webloggers." link="http://allconsuming.net/" number_updated="172" /> <asins asin="0465045669" title="Metamagical Themas: Questing for the Essence of Mind and Pattern" author="Douglas R. Hofstadter" url="http://www.erikbenson.com/" image="http://images.amazon.com/images/P/0465045669.01.THUMBZZZ.jpg" excerpt="Douglas Hoftstadter's lesser-known book, Metamagical Themas, has a great chapter or two on self-referential sentences like 'This sentence was in the past tense.'." amazon_url="http://amazon.com/exec/obidos/ASIN/0465045669/" allconsuming_url="http://allconsuming.net/item.cgi?id=0465045669" /> </opt>
Each hour it's typical that the script might search between two hundred and a thousand newly updated sites, finding up to a thousand unique ISBN/URL pairs. It writes information to a file for each book mention that includes all the information we need to know about it, discarding the rest. This script takes between ten minutes to a half an hour to run, depending on the number of weblogs that have updated and the level of traffic on my server.
After it's done, another script starts up that reads the file that the first script wrote to and loads up (from an RDBMS) all of the sites that I know to have mentioned those books in the past. Because most of these sites have several days worth of entries on them, when they update on the most recent day I'll still find books that I know they mentioned earlier in the week. I don't care about them as much, though, so I remove duplicates from these two lists, and I'm left with a short list of books that were mentioned for the first time at any given URL during the last hour. Of the original thousand book mentions or so, between five and fifteen of them might be brand new ones. I store those five to ten ISBNs into the database and create an XML and RSS file (as shown above) so that others can reuse this data in their own applications. Here's a link to the current XML and RSS files at All Consuming for new books mentioned this last hour: XML and RSS.
Finally, a third script loads from the database every book mention from the last week, giving every book that has been mentioned on every site a score. Book mentions from the current day get a score of seven, book mentions from yesterday a score of six, and so on until book mentions from six days ago get a score of one. I sort all of these books by score, and display the top thirty books mentioned on weblogs within the last week. It's weighted to favor books that were mentioned recently, so the really fresh stuff floats to the top. All of that is also saved as XML and RSS in case anyone wants to use that data and build something else with it.
Incidentally, the front page of All Consuming relies on this same XML feed that's available to everyone else to display the books. I do this so that I'm the first one to know when it breaks. With that list of book ISBNs, I can use Amazon's web services to find out the book's author, title, cover image, average customer review, and price. Getting this information from Amazon is as simple as generating a URL with the ISBN and your free developer's token (so Amazon can track your usage and make sure you're not abusing the service), and then parsing the XML that it returns. Amazon has a SOAP interface, too, but at the moment I think it's a little bit of an overkill. I prefer just to work with the straight XML.
Making the Data Interesting
What I've described so far can be likened to the first step in any creative process, namely, gathering material. We've used automated scripts and web services to collect a very specific type of data from a very large pool of information, like collecting blackberries from a forest of various types of trees and shrubbery. Now it's possible to take that data, those berries, work with them in a context outside of the shrubbery and dirt and chaos, and make a delicious pie.
We know what books people are talking about the most, but what if you just want to know what people are saying about a particular book? That's just a different slice of the data that we already have. All Consuming has a page for each book in Amazon's catalog (basically every book in print in America and the UK), and it lists every weblog that we know to have mentioned it since the site first launched last August. Instantly you can get a sense for how much any book (take the currently popular book by weblogger Cory Doctorow, Down And Out In the Magic Kingdom) has been talked about, when it was being talked about, and by whom.
Other information relevant to the book at hand can be hung from this page. For example, we also display books similar to the current one (with an indicator of how many weblogs have mentioned them), user comments (which I'll get to soon), and any other information that users feel want to attribute to the book (first sentences, links to related sites, number of pages, etc).
There are two different types of users at All Consuming, and one is a subset of the other. First, there are all the weblogs out there that I've discovered through the automated hourly script. Within that set is the group of all people who have visited All Consuming and have chosen to create a user account. Part of the registration process is to have them let me know where there weblog is, if they happen to have one.
I realized that we could slice the data another way by looking at it from the perspective of a particular weblog (here's my weblog's page on All Consuming for erikbenson.com). Every weblog that we know about has a page dedicated to it and a list of books that it has mentioned.
People who've explicitly signed up at All Consuming can create their own booklists directly on the site. There are lists for what they're currently reading, what they're reading next, what they've completed, and what their favorite books of all time are. User accounts allow people to attach more specific motivations and intentions to their book talk than I could otherwise glean from just scraping their site.
The result looks something like this:
SOAP and XML
Working with web services is addictive. You use them to gather data; but, after aggregating it and analyzing it and shaping the data into a couple new forms, it only makes sense to allow people to get that data from you. It's like an animal: the data goes in one end, it gets digested and filtered, and the data comes out the other end. The only difference is that the data on the tail end is more highly organized, more valuable, and just generally more nutritious than the data that originally went into the beast.
For the more technically savvy, All Consuming has a SOAP interface that can be used to retrieve the hourly and weekly lists, user reading lists, and several other types of information.
Here's an example of how to use the SOAP interface to All Consuming:
use SOAP::Lite +autodispatch => uri => 'http://www.allconsuming.net/AllConsumingAPI', proxy => 'http://www.allconsuming.net/soap.cgi'; my ($hour,$day,$month,$year) = qw(12 3 15 2003); my $AllConsumingObject = AllConsumingAPI->new($hour,$day,$month,$year); my $data = $AllConsumingObject->GetHourlyList;
The public methods available for use include:
GetArchiveList, GetCurrentlyReadingList, GetFavoriteBooksList, GetCompletedBooksList,
GetPurchasedBooksList, GetRereadingBooksList, GetNeverFinishedBooksList,
GetRecommendations. At the moment, most of these methods
only require that you pass in a URL or user name, but more detailed instructions on
call these methods is supplied here
and code samples are given here.
For those who prefer XML, there are also several directories full of interesting slices of data, available here, for whoever wants to play with it.
The way I've been developing the SOAP and XML interfaces has been organic and iterative. Rather than implement every possible method I can think pf, I implement a bit, see how people use it that little bit, and then fill out it out with their suggestions and recommendations. I will continue to fill out the feature-set of the SOAP and XML interfaces as long as people are interested in using it, so if you don't see something that you think I have that you could use, let me know and I'll try to make it for you. Some interesting applications that people have built on top of All Consuming's web services include DJ Adams booktalk script (explanation here) and Kellan Elliott-McCrea's custom script for displaying his currently reading bucket on his weblog.
The Network of Friends and Recommendations
The original motivation for All Consuming was not actually limited to creating interesting book lists. It was also intended to explore the possibilities of friend networks, trust networks, and reputation systems, and this is where I see the site developing in the future. Aggregation tools like Blogdex, Daypop, and Technorati are extremely interesting in and of themselves, providing views into the world that we have not been able to really see before. But I believe that as the sphere of the web continues to expand and it becomes increasingly easier and easier to see what the entire world is reading, linking to, and discussing, there will be a desire to personalize that experience a bit and see what the people you know and trust are reading, linking to, and discussing.
But how to define that? Start with small steps, of course. At All Consuming, like many other places, you can create a list of friends that you'd like to keep an eye on. You don't have to create it from scratch either. Given that we know your weblog, we can ask Google for similar sites to your own, Blo.gs for sites that you've marked as friends, and even see who on All Consuming is reading the same books as you. All Consuming presents all these suggestions to you, and you can select the weblogs from that list that you like or enter weblogs that are not in the list. Either way, the end result is a list of friends.
With that list, you can subscribe to an email so that whenever All Consuming sees that one of your friends is reading something new (either because the automatic script picked it up or they manually entered it into the site) you will know about it right away. Also, at any point you can get a list of book recommendations that display the books that all of your friends are reading. Those books that have been mentioned recently or by more than one of your friends will be at the top of the list. Today my recommendations look like this:
In the coming months, I'd like to take this a couple steps further. I'd like to be able to take a look at the set of all books that my friends, and my friends' friends, and my friends' friends' friends are reading, sorted so that the books that are being read and talked about the most are at the top. I can see this as being a practical implementation of the work that's already been done with FOAF, since one of my major problems with FOAF is that it leaves too much interpretation as to what a friend is. If I'm making a list of people whom I'd like to have recommend books to me, though, and it existed in a network of people who displayed similar connections between people, that would actually be useful and meaningful.
With web services that understand relationships between sites and that are able to glean book information from them, this ultimately becomes just a question of the server's computational power (with my current setup, it would be prohibitively expensive for me to scan three generations out for every user at All Consuming). However, I'm confident that I'll be able to find a way to scale this, whether it's a feature that only paying members can access or something else, because the data would just be so interesting. Once we find one good application of passing information back and forth through web services on a friends network, I believe that web services like XML and SOAP will open up into an even broader landscape of innovations that they're currently in.
The data is out there, stored digitally on weblogs and other sites all over the Internet, just waiting to be looked at. The data is accessible via standard HTML markup and web services like SOAP and XML, making it easily processed and interpreted by simple machines and scripts that anyone can write. Finally, the data is interesting: it gives us a glimpse of the patterns and trends that emerge out of the collective activity of the entire group, bypassing the traditional necessity of trusting a few voices to represent the many. It is truly a model for distributed idea generation and interpretation which is only beginning to be tapped. All Consuming is a tiny filter on top of this vast collection of group activity, aimed solely at finding connections between weblogs and books, but I look forward to the day when hundreds of other views of the data are available to consume and build upon.