Putting RDF to Work
Over recent months, members of the www-rdf-interest mailing list have been working at creating practical applications of RDF technology. Notable among these efforts have been Dan Connolly's work with using XSLT to generate RDF from web pages, and R.V. Guha's lightweight RDF database project.
RDF has always had the appeal of a Grand Unification Theory of the Internet, promising to create an information backbone into which many diverse information sources can be connected. With every source representing information in the same way, the prospect is that structured queries over the whole Web become possible.
That's the promise, anyway. The reality has been somewhat more frustrating. RDF has been ostracized by many for a complex and confusing syntax, which more often than not obscures the real value of the platform. One also gets the feeling, RDF being the inaugural application of namespaces, that there's a certain contingent who will never forgive it for that!
As an RDF-advocate, I am dismayed when some emerging Web metadata applications reject RDF -- the reason given usually being "it's too hard." I tend to think that a rather weak reason, especially as many of the same people are attempting deployment of XML schemas! However, I can't dispute that the current RDF syntax isn't the best, and as long as there is metadata on the Web that can be converted to RDF by means of a simple transform, we retain our hope of a "semantic web" of information.
Some of the most important factors in XML's success have been the ready availability of tools (in particular, parsers) and ubiquitous APIs (SAX, DOM). RDF has not matched that level of support, with the consequence that it has felt a lot more like a research project than an immediately applicable technology. However, some folks have been committing their time to developing tools for manipulating RDF, and also moving toward standardized APIs.
One tool in particular, R.V. Guha's RDFDB, caught my attention. It's an RDF database server, based on top of the Sleepycat Berkeley Database. The source code is in C, but more importantly, it supports interrogation via TCP/IP sockets, meaning integration is possible with any programming language. For me, this is an advantage over previous RDF libraries in Java and Perl, neither of which are my platforms of choice.
Breaking Out of Hierarchies
After spending a little while with RDFDB, I began to see that it offers what I'm looking for in an RDF store. Let me explain a little further about my criteria. My personal RDF dream centers around the integration of all my information. I want to be able to traverse the relationships between my surfing, e-mail, schedule, and document data. The hierarchy of e-mail folders and the file system just doesn't reflect the way I work.
That reality of this dawned on me as I found myself using the
locate to connect and
cross-reference documents and e-mail. The way I was traversing my
data was task-centric. If I'm working on a particular topic, I
want to see all previous correspondence on that issue. If I
visit someone's web page, I might want to see all the mail that
person has sent me recently.
So began my dream of integrating all my metadata. Somewhere there would be a large database into which my e-mail, web browser, file system, and so on would enter metadata. I'd then be able to, with relative ease, query the database to make connections between data items on my computer. On top of that database, graphical clients could be written to maintain and annotate it, and hooks written back into the browser, file manager, and e-mail client to allow the use of this extra information.
RDFDB appears to be the first stage of this plan, a database tuned for storing and querying descriptions of resources. (Note that there are existing approaches to RDF storage using a relational database, but RDFDB takes a specialist approach to storing RDF data).
What RDFDB Offers
Although an early stage project, RDFDB offers enough functionality to do useful work immediately. Guha has tried to keep the interface similar to that of SQL, in order to make the learning curve easier. With RDFDB, you can:
create database testdb </> insert into testdb (editor http://xml.com/ http://edd.oreillynet.com/) </> select ?p from testdb where (editor ?p http://edd.oreillynet.com/) </>
</> line terminator). Facilities also exist for loading entire modules of data in
from an RDF file, and for assigning prefixes to namespaces.
Both binaries for Linux and source code are available. You'll need Sleepycat's Berkeley DB3.1 installed in order to compile RDFDB. The RDFDB server itself runs as a TCP/IP server, and just sits there waiting for connections. You can use telnet as a trivial command-line interface to the server -- this is one of Guha's design goals, that RDFDB access should be as easy as HTTP access.
Once a couple of environmental variables are set and the server
is running, it's easy to
start working with RDFDB, by inserting simple relationships into
the database and querying them. RDFDB also offers a facility to
perform batch import of RDF data, via the
load ... file
A good source of example data can always be found in one's mailbox. To take the first step toward my dreams of integration, we first need to invent a vocabulary for describing the data. This is where the prototype nature of my project becomes apparent: a well-designed vocabulary is probably 80 percent of the work in an effort like this. Particularly when integration with disparate sources is required, a common vocabulary is essential, and using standards such as Dublin Core becomes a very good idea.
Here are the properties I settled on:
- realName: a person's name
- author: the author of a message
- subject: the subject of a message
- timestamp: the timestamp of a message
A basic use of these properties would be simply to scrape all the names and addresses from my in-box in order to create an address book. With a small bit of Python, I generated a document looking something like this:
<?xml version='1.0'?> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:m='http://edd.oreillynet.com/mailbox/'> <rdf:Description about='mailto:firstname.lastname@example.org'> <m:realName>Edd Dumbill</m:realName> </rdf:Description> <rdf:Description about='mailto:email@example.com'> <m:realName>Liora Alschuler</m:realName> </rdf:Description> <rdf:Description about='mailto:firstname.lastname@example.org'> <m:realName>Lisa Rein</m:realName> </rdf:Description> </rdf:RDF>
The file was placed in a suitable place on my machine and imported into a database:
create database mailstore </> load XML_RDF file http://localhost/addrs.rdf into mailstore </>
A few simple queries (user input in bold, lines broken for convenience, queries should be written all on one line):
select ?x from mailstore where (http://edd.oreillynet.com/mailbox/|realName ?x 'Lisa Rein') </> ?x = mailto:email@example.com select ?x from mailstore where (http://edd.oreillynet.com/mailbox/|realName mailto:firstname.lastname@example.org ?x) </> ?x = Simon St.Laurent select ?x from mailstore where (?x mailto:email@example.com 'Edd Dumbill') </> ?x = http://edd.oreillynet.com/mailbox/|realName
You can see that the property, subject, and object (the components of an RDF description) can all be queried by the database server. Although trivial in this example, querying properties could have some very useful purposes, such as determining the relationship between two people.
Writing out the qualified names of the properties each time is a little cumbersome, so RDFDB allows you to do this instead:
enter namespace xmlns:m http://edd.oreillynet.com/mailbox/ </> select ?x from mailstore where (m:realName mailto:firstname.lastname@example.org ?x) </>
Now let's take things a little further and include some more data from the mailbox. I wrote a simple Python script to parse my mailbox and extract some message data. In addition to the addressbook entries above, I also include a description for each message:
<rdf:Description about='mid:3990A5D9.C13AACDE@finetuning.com'> <m:subject>Re: Submitting an article to xml.com</m:subject> <m:timestamp>Tue, 08 Aug 2000 17:29:13 -0700</m:timestamp> <m:author rdf:resource='mailto:email@example.com' /> </rdf:Description>
Note that for the e-mail message identification itself, I'm
mid: URI scheme
(more on URI
schemes). Having imported the RDF again into my database,
I can now answer questions like "Which e-mail messages were written by
Simon St. Laurent?":
select ?a from mailstore where (m:realName ?a 'Simon St.Laurent') </> ?a = mailto:firstname.lastname@example.org select ?i from mailstore where (m:author ?i mailto:email@example.com) </> ?i = mid:200008022049.QAA05551@hesketh.net ?i = mid:200008030436.AAA29373@hesketh.net ?i = mid:200008030444.AAA29581@hesketh.net ?i = mid:200008061638.MAA09311@hesketh.net ?i = mid:200008081347.JAA04357@hesketh.net ?i = mid:200008090020.UAA06745@hesketh.net
I tried a few more exotic queries, using conjunctions, but RDFDB currently seems a little flaky in its processing of these. Using this form the above query could be reduced to:
select ?i from mailstore where (m:realName ?a 'Simon St.Laurent') (m:author ?i ?a) </>
RDFDB offers a great backbone -- storage and query facilities -- for integrating diverse information sources. In its early stages now, it's a project that deserves to get more mindshare. The SQL-like syntax brings a familiarity to querying that other, more Prolog-like, mechanisms don't.
Architecturally, I find the implementation of RDFDB as a database server a great advantage. It immediately makes multiple data sources and clients a reality, and makes cross-platform implementation easy (writing a language client to RDFDB is pretty trivial, I managed a workable first cut in 10 lines of Perl).
RDF is slowly getting more use in the field, but it needs more ubiquitous, easy-to-use technology and APIs to be an obvious first-stop for metadata and resource discovery applications. RDFDB can make an important contribution in this area.