November 10, 2004

Matt Biddulph

What is

A social bookmarks manager, lives on the web. You submit your links to a website, adding some descriptive text and keywords, and aggregates your post with everyone else's submissions--letting you slice and dice the information any way you like. Posts with the same keywords are clumped together, and if enough people link to a URL, a loose classification emerges. In this article, we'll explore through code and take a look at user behavior in posting and tagging links.

The has three major axes: users, tags, and URLs. These axes are reflected in the clean URL design used by the site:
mattb's URLs.
URLs with the tag xml.
references to the URL (8b7fec48fcb35763c9f8e1a8061eb124 is the md5sum digest of the URL).

Some of these axes can be combined in a single URL:
mattb's URLs with the tag xml.
URLs with the tags xml and rdf.
mattb's URLs with the tags xml and rdf.

Taking It to Code

In order to build tools to better explore the information in, I decided to create a Python wrapper. The classes provided by the code define a Post as an Href and zero or more Tags chosen by a User. Each of these classes is capable of finding further Posts, forming a graph of information that can be followed in client code. For example, you could start with an Href, find all the Users who posted it, then see what other Hrefs were posted with the same Tags that they used. Adding a bit of Python magic using iterators, factory objects, and __getattr__, we can make clean-looking Python code that corresponds to URLs:

for post in delicious.users.mattb: print post.title
print the titles of all mattb's posts.
for post in delicious.tags.xml: print post.href
print the URLs of all posts tagged as xml.
for post in delicious.users.mattb(delicious.tags.xml): print post.title
print the titles of all mattb's posts that are tagged as xml.

Your Site is Your API

You're putting some effort into contributing information to, and it repays you by exposing its data and functionality in several ways so that you can get it back again without asking any favors or getting special access. It has an HTTP API for posting and managing your account. Nearly every useful page has a related RSS feed, exposing not just the human-readable titles and links but also neatly structured machine-readable information on tags and authorship. Finally, for the odd bit of stuff not exposed through an API or RSS feed, the HTML is free of presentational fluff and cleanly marked up with useful CSS classes.

Tying this together, we'll need a three-pronged approach to Python-enabling this HTTP information space:

  • Reading XML using libxml2
  • Following the information graph represented in an RSS 1.0 feed using the Redland RDF toolkit
  • Scraping data straight from HTML with libxml2's HTML tagsoup parser

Recently, Paul Ford used libxml2 to screenscrape the U.S. Senate. He used XSLT to drive it; we'll be using the Python libxml2 wrappers to access XML and HTML parsers and perform XPath queries on the results. Here's the kind of operation we'll perform:

# for HTML:

html = libxml2.htmlParseDoc(get_url_contents(''),'ISO-8859-1')

posts = html.xpathEval("//div[@class='delPost']")

# for XML:

xml = libxml2.parseDoc(get_url_contents(''))

tags = xml.xpathEval("/tags/tag")



As you can see, after you've run the parser, it doesn't matter whether the input was XML or HTML. You can make XPath queries on any parsed document.

To read RSS with Redland, the code works like this:

from RDF import *

model = Model()

parser = Parser()

url = ""


items = model.find_statements(Statement(





As with any web client, our code needs to be a respectful and polite citizen: it shouldn't accidentally mount a denial-of-service attack on the site. The API documentation lists some reasonable restrictions: identifying yourself via the user-agent string, waiting a bit between queries, and watching out for HTTP error code 503, which indicates that you're being too eager and should back off before retrying.

We can do that and a bit more too. Because everything we do is going over good old HTTP, there's plenty of off-the-shelf code to follow web best practices. Joe Gregorio's is a simple alternative to python's builtin urlopen() feature that does all the work for you in managing ETags, caching, and gzip compression where available. Instead of writing this:

url = ""

content = urllib.urlopen(url).read()

you simply write

import httpcache

url = ""

content = httpcache.HTTPCache(url).content()


Code Examples

Let's have a few examples of how the Python library can be used.

How Did This URL Get Tagged, and by Whom?

First, we'll aggregate the usernames and tags associated with the URL by stepping through the posts found with an Href object.

tags = Set()

users = Set()

for post in delicious.Href(""):

    tags = tags.union(post.tags)


print "Users:",users

print "Tags:",tags

Starting with a URL, Find Other URLs with the Same Tags

Again, we'll start with an Href object and find all the postings of that URL. After collecting all the tags in a Python Set, to ensure that there are no duplicates, we'll step through and find each post on each tag.

unique_tags_used = Set()

for post in delicious.Href("").posts():

    unique_tags_used = unique_tags_used.union(post.tags)

print "* URLs that share a tag with"

for tag in unique_tags_used:

    print tag,":",

    for post in tag.posts():

        print post.href,


Related Tags

A "related tags" sidebar on tag listing pages was recently introduced by This is a neat way of branching out from a tag to explore further topics. Here's an example of how we can use this to see which programming languages are related to another technology:

for language in ['python','java','perl']:

    if delicious.tags.rdf in delicious.Tag(language).related():

        print language,"is related to RDF"


Running this, we get (at the time of writing):

python is related to RDF

Alternatively, we can find out what the tags have in common:

common = None

for language in ['python','java','perl']:

    if common is None:

        common = delicious.Tag(language).related()


        common = common.intersection(delicious.Tag(language).related())

print common

giving us:

Set([web, programming, tools])

Looking at Tagging Behavior

After writing this Python code, I did a couple of experiments. I contribute to a group weblog called The Daily Chump, which is created directly by a bot from an IRC channel. The site is all XML based, so it was a snap to filter out only the links I post each day and re-post them to with a regularly-scheduled script. This produces a site-within-a-site: slice along the tag axis by 'dailychump', and you not only see my chumpings but also a handy summary of Daily Chump keywords used on those posts and links to other people's postings of the same items ('... and 9 other people', etc).

In a similar vein, I re-posted photos from my photo website, picdiary. Every photo on that site is annotated using RDF statements about people and things depicted. I transformed the RDF URIs into simple tags and ended up with, which arguably has better cross-site navigation than my own site by virtue of the 'all tags' right-hand column.

After playing about a bit with these toy applications, I wanted to study the community in action. There's a lot of enthusiasm for the site; tons of URLs get linked on it every day, and any popular link gets a great collection of tags slapped on it. People are using it to build parts of their blogs, creating ridiculously named add-ons, and generally adopting it as part of their own infrastructure. I hoped that if I wrote an interesting blog post about my experiments, it would get linked a lot, and I could see both the information that was gathered around it on and the effect on my site traffic.

The Community and the Blog Post

The post was quite a success; at the time of writing, it's been tagged by 175 users. Most users used just one, two, or three tags:

distribution of number of tags used

The choice of tag follows something resembling the Zipf or power law curve often seen in web-related traffic. Just six tags (python, delicious/, programming, hacks, tools, and web) account for 80% of all the tags chosen, and a long tail of 58 other tags make up the remaining 20%, with most occurring just once or twice.

Now let's look at attention over time. As you'd expect, the post got the most links immediately after it was discovered by the community. Half of the links were made in the first 72 hours:

distribution of links made over time

After that, however, it didn't just tail off smoothly. Little ripples of attention recurred from time to time. Unlike another well-known source of web traffic, a slashdotting, attention from communities is not simply driven from a single source with a short attention span. The site has several mechanisms that can re-amplify a link several days after it has first occurred.

In the community, the rich get richer and the poor stay poor via Links noted by enough users within a short space of time get listed here, and many users use it to keep up with the zeitgeist. Once my post hit "/popular" (within the first 24 hours), it got quite a boost.

Using their inbox, users can keep up with the latest links appearing against any tag or user that they choose to subscribe to. I initially posted the link under the tags python,, and hacks (three of the most popular tags used by others linking the post later, incidentally). Anyone who didn't catch the link on the homepage, which churns by pretty quickly, was likely to find it by subscribing to my username (mattb) or to one of those tags. As the Python subcommunity grew bored with the link, interest waned until it was either re-posted by a highly-respected user who has many inbox subscribers, or someone used a previously-unused tag which put it in front of a fresh user.

Future Directions

In this article, we've just scratched the surface of the ways we could traverse and map what the community is building inside Clearly, there are any number of visualizations that could be built from this data. To track how memes can spread, we could enhance the Python tool to read users' inbox subscriptions. Knowing who reads whom and who tracks what, we could reverse engineer the attention network. Alternatively, we could compare and contrast the use of tagging in with that in other sites that use tagging such as flickr: is there a difference in user behavior when considering visual subjects rather than web pages?

If you want to play with the Python code used in this article, you can get it from my site. I'll enjoy analyzing your impact on my stats and any tags you put on this article.