November 10, 2004
What is del.icio.us?
A social bookmarks manager, del.icio.us lives on the web. You submit your links to a website, adding some descriptive text and keywords, and del.icio.us aggregates your post with everyone else's submissions--letting you slice and dice the information any way you like. Posts with the same keywords are clumped together, and if enough people link to a URL, a loose classification emerges. In this article, we'll explore del.icio.us through code and take a look at user behavior in posting and tagging links.
The del.icio.us-space has three major axes: users, tags, and URLs. These axes are reflected in the clean URL design used by the site:
- URLs with the tag
- references to the URL
http://www.xml.com/(8b7fec48fcb35763c9f8e1a8061eb124 is the
md5sumdigest of the URL).
Some of these axes can be combined in a single URL:
mattb's URLs with the tag
- URLs with the tags
mattb's URLs with the tags
Taking It to Code
In order to build tools to better explore the information in del.icio.us, I decided
create a Python wrapper. The classes provided by the code define a Post as an
Href and zero or more Tags chosen by a User. Each of these classes is
capable of finding further Posts, forming a graph of information that can be followed
in client code. For example, you could start with an Href, find all the Users
who posted it, then see what other Hrefs were posted with the same Tags that
they used. Adding a bit of Python magic using iterators, factory objects, and
__getattr__, we can make clean-looking Python code that corresponds to
for post in delicious.users.mattb: print post.title
- print the titles of all
for post in delicious.tags.xml: print post.href
- print the URLs of all posts tagged as
for post in delicious.users.mattb(delicious.tags.xml): print post.title
- print the titles of all
mattb's posts that are tagged as
Your Site is Your API
You're putting some effort into contributing information to del.icio.us, and it repays you by exposing its data and functionality in several ways so that you can get it back again without asking any favors or getting special access. It has an HTTP API for posting and managing your account. Nearly every useful page has a related RSS feed, exposing not just the human-readable titles and links but also neatly structured machine-readable information on tags and authorship. Finally, for the odd bit of stuff not exposed through an API or RSS feed, the HTML is free of presentational fluff and cleanly marked up with useful CSS classes.
Tying this together, we'll need a three-pronged approach to Python-enabling this HTTP information space:
- Reading XML using libxml2
- Following the information graph represented in an RSS 1.0 feed using the Redland RDF toolkit
- Scraping data straight from HTML with libxml2's HTML tagsoup parser
Recently, Paul Ford used libxml2 to screenscrape the U.S. Senate. He used XSLT to drive it; we'll be using the Python libxml2 wrappers to access XML and HTML parsers and perform XPath queries on the results. Here's the kind of operation we'll perform:
# for HTML: html = libxml2.htmlParseDoc(get_url_contents('http://del.icio.us/popular'),'ISO-8859-1') posts = html.xpathEval("//div[@class='delPost']") # for XML: xml = libxml2.parseDoc(get_url_contents('http://del.icio.us/api/tags/get')) tags = xml.xpathEval("/tags/tag")
As you can see, after you've run the parser, it doesn't matter whether the input was XML or HTML. You can make XPath queries on any parsed document.
To read RSS with Redland, the code works like this:
from RDF import * model = Model() parser = Parser() url = "http://del.icio.us/rss/mattb" parser.parse_string_into_model(model,get_url_contents(url),Uri("http://del.icio.us/")) items = model.find_statements(Statement( None, Uri("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"), Uri("http://purl.org/rss/1.0/item")))
As with any web client, our code needs to be a respectful and polite citizen: it shouldn't accidentally mount a denial-of-service attack on the site. The API documentation lists some reasonable restrictions: identifying yourself via the user-agent string, waiting a bit between queries, and watching out for HTTP error code 503, which indicates that you're being too eager and should back off before retrying.
We can do that and a bit more too. Because everything we do is going over good old
there's plenty of off-the-shelf code to follow web best practices. Joe Gregorio's
httpcache.py is a simple
alternative to python's builtin
urlopen() feature that does all the work for
you in managing ETags, caching, and gzip compression where available. Instead of writing
url = "http://del.icio.us/mattb" content = urllib.urlopen(url).read()
you simply write
import httpcache url = "http://del.icio.us/mattb" content = httpcache.HTTPCache(url).content()
Let's have a few examples of how the del.icio.us Python library can be used.
How Did This URL Get Tagged, and by Whom?
First, we'll aggregate the usernames and tags associated with the URL by stepping
the posts found with an
tags = Set() users = Set() for post in delicious.Href("http://www.xml.com/"): tags = tags.union(post.tags) users.add(post.user) print "Users:",users print "Tags:",tags
Starting with a URL, Find Other URLs with the Same Tags
Again, we'll start with an
Href object and find all the postings of that URL.
After collecting all the tags in a Python Set, to ensure that there are no duplicates,
step through and find each post on each tag.
unique_tags_used = Set() for post in delicious.Href("http://www.xml.com/").posts(): unique_tags_used = unique_tags_used.union(post.tags) print "* URLs that share a tag with http://www.xml.com/" for tag in unique_tags_used: print tag,":", for post in tag.posts(): print post.href, print
A "related tags" sidebar on tag listing pages was recently introduced by del.icio.us. This is a neat way of branching out from a tag to explore further topics. Here's an example of how we can use this to see which programming languages are related to another technology:
for language in ['python','java','perl']: if delicious.tags.rdf in delicious.Tag(language).related(): print language,"is related to RDF"
Running this, we get (at the time of writing):
python is related to RDF
Alternatively, we can find out what the tags have in common:
common = None for language in ['python','java','perl']: if common is None: common = delicious.Tag(language).related() else: common = common.intersection(delicious.Tag(language).related()) print common
Set([web, programming, tools])
Looking at Tagging Behavior
After writing this Python code, I did a couple of experiments. I contribute to a group weblog called The Daily Chump, which is created directly by a bot from an IRC channel. The site is all XML based, so it was a snap to filter out only the links I post each day and re-post them to del.icio.us with a regularly-scheduled script. This produces a site-within-a-site: slice del.icio.us along the tag axis by 'dailychump', and you not only see my chumpings but also a handy summary of Daily Chump keywords used on those posts and links to other people's postings of the same items ('... and 9 other people', etc).
In a similar vein, I re-posted photos from my photo website, picdiary. Every photo on that site is annotated using RDF statements about people and things depicted. I transformed the RDF URIs into simple tags and ended up with http://del.icio.us/picdiary, which arguably has better cross-site navigation than my own site by virtue of the 'all tags' right-hand column.
After playing about a bit with these toy applications, I wanted to study the del.icio.us community in action. There's a lot of enthusiasm for the site; tons of URLs get linked on it every day, and any popular link gets a great collection of tags slapped on it. People are using it to build parts of their blogs, creating ridiculously named add-ons, and generally adopting it as part of their own infrastructure. I hoped that if I wrote an interesting blog post about my experiments, it would get linked a lot, and I could see both the information that was gathered around it on del.icio.us and the effect on my site traffic.
The Community and the Blog Post
The post was quite a success; at the time of writing, it's been tagged by 175 del.icio.us users. Most users used just one, two, or three tags:
The choice of tag follows something resembling the Zipf or power law curve often seen in web-related traffic. Just six tags (python, delicious/del.icio.us, programming, hacks, tools, and web) account for 80% of all the tags chosen, and a long tail of 58 other tags make up the remaining 20%, with most occurring just once or twice.
Now let's look at attention over time. As you'd expect, the post got the most links immediately after it was discovered by the community. Half of the links were made in the first 72 hours:
After that, however, it didn't just tail off smoothly. Little ripples of attention recurred from time to time. Unlike another well-known source of web traffic, a slashdotting, attention from del.icio.us communities is not simply driven from a single source with a short attention span. The del.icio.us site has several mechanisms that can re-amplify a link several days after it has first occurred.
In the del.icio.us community, the rich get richer and the poor stay poor via http://del.icio.us/popular. Links noted by enough users within a short space of time get listed here, and many del.icio.us users use it to keep up with the zeitgeist. Once my post hit "/popular" (within the first 24 hours), it got quite a boost.
Using their inbox, del.icio.us users can keep up with the latest links appearing against any tag or user that they choose to subscribe to. I initially posted the link under the tags python, del.icio.us, and hacks (three of the most popular tags used by others linking the post later, incidentally). Anyone who didn't catch the link on the del.icio.us homepage, which churns by pretty quickly, was likely to find it by subscribing to my username (mattb) or to one of those tags. As the Python subcommunity grew bored with the link, interest waned until it was either re-posted by a highly-respected user who has many inbox subscribers, or someone used a previously-unused tag which put it in front of a fresh user.
In this article, we've just scratched the surface of the ways we could traverse and map what the community is building inside del.icio.us. Clearly, there are any number of visualizations that could be built from this data. To track how memes can spread, we could enhance the Python tool to read users' inbox subscriptions. Knowing who reads whom and who tracks what, we could reverse engineer the attention network. Alternatively, we could compare and contrast the use of tagging in del.icio.us with that in other sites that use tagging such as flickr: is there a difference in user behavior when considering visual subjects rather than web pages?
If you want to play with the Python code used in this article, you can get it from my site. I'll enjoy analyzing your impact on my stats and any tags you put on this article.