Distributed XML
The role played by XML in the next-generation Web
Introduction
It's not cool to be different, at least not where Internet computing is concerned. Despite widespread agreement about certain subsections of Internet technology -- SMTP for email, for instance -- many services and sources of data remain desperately unconnected. The same is equally true of desktop computing. Although office suites provide some degree of integration, exchanging data between applications from different vendors is very frustrating. Add the Web and email into the mix and the problem gets worse.
The more I use and rely on computers, the more I realize I ought to stand up for my rights. I'm abused and tormented by a patchwork of programs that hardly work together, that trap my data in places I don't want, and make me adopt unnatural working styles. The busier I get, the more I get buried in information overload, the more I realize this is happening, and the more I want it fixed.
XML offers hope for escape from the current situation of fragmentation and disarray. In this talk I will focus on two technologies that look as though they'll have a big impact in this area: SOAP and RDF. I'll also talk about the shift in architecture, from centralized to decentralized, that we'll need to embrace as the world of Internet computing continues to grow.
The Dream
The dream that drives the integrated vision of the future is of a universal homogeneous view of information. No special cases or peculiar formats but a universally accessible "data bus" over the realm of the Internet, and by extension, all your private data sources. Let's look at the components of such a system. Fundamentally, they are a universal addressing scheme and a universal data format. In many ways these represent the essential components of a universal computer.
The addressing scheme, universal resource identifiers (URIs), has been in operation over the web for a long time now. The data format, XML, has been around for nearly three years, and it's clearly providing many benefits in reducing the translation overheads of communication within and between organizations.
We're now close to the conditions in which computing can be performed over the whole span of the Internet. However, every computer requires instructions and a language in which to program. These are the problems we need to work on now in order to realize the greater promise.
One of the luxuries of being able to address a conference full of XML developers is the chance to urge you to check out some new ideas. I hope that some of this dream will catch in your minds.
The Universal Computer
I've already mentioned that URIs are pretty much set in place as an addressing scheme, and it looks increasingly like HTTP is becoming the transport for information and instructions in our "universal computer". Yet this still leaves a couple of problems left unsolved:
How do we encode the data? Plain old XML won't do by itself -- how do we say something is an integer, something represents a "Person", etc.?
How do we encode instructions? If we want to cause another computer somewhere else on the web to perform a function, how do we express that?
These questions lead me to talk about two technologies that go some way to answering these questions: RDF and SOAP.
Characterizing RDF
RDF, the Resource Description Framework, is a technology invented at the W3C. It was one of the earliest XML applications, and definitely the first to use XML namespaces in earnest. For various reasons, its rise has not been the up-and-up that XML itself, and more recently XSLT, has achieved, but, rather, a slow but steady expansion. Its immediate user community has not been e-commerce, which has also been an important factor.
What is RDF for, and what are its qualities? Well, it does what it says on the package: RDF is a language for describing things. Well, one might object, you can describe things in XML anyway, so why do you need more? The answer is that XML is too flexible, and you need conventions. As an example, there are many ways to indicate the color of something in XML. Try to describe a "red car":
<car color="red" /> <car><color>red</color></car> <car color="#cc" /><color id="cc" shade="red" />
I just came up with these three on the spur of the moment, there are lots of other ways you could write that fact down. What RDF does is to invent a standard way of interpreting XML-encoded descriptions of things, or "resources", which turns out to be very useful.
Further, RDF employs URIs as a naming scheme. This means that there's one naming convention, which has the property of being able to generate globally unique names for your resources. This is another important feature RDF needs to be able to model the real world. (Incidentally, it also shines the spotlight on the fact that naming anything is very hard indeed!) One consequence of using URIs is that it enables RDF to be used in a decentralized fashion: unique names need no context to qualify them, so anyone anywhere can write descriptions that involve anything anywhere.
Another key feature of RDF is that it's openly extensible. Unlike plain old XML, there's no sense of constraining what the document can describe by a DTD or schema. This means that if you get a description of something from someone, and want to add your own observations to it, you can do so without having to agree to a change in the schema. For certain classes of application& -- particularly annotation and metadata applications -- this is a great advantage. It also means that systems using RDF run less risk of getting stuck in a legacy file format situation: the ability to cope in a forward-compatible manner usually comes hand-in-hand with using RDF.
Essential RDF
Let's take a look at some of the essentials of RDF. At its most basic, RDF is a way of modeling things. A lot XML technologies tend to start at the syntax and go from there (following the lead of XML itself). To get a good understanding of RDF, it's better to start with the model.
Everything in RDF can be represented by a graph with nodes and arcs. Each node is a resource, and each arc represents a property. Both properties and resources are named with URIs. What does this mean? It means that the whole Web and beyond (in short, anything which you can name) is within the scope of an RDF description. In effect, RDF graphs boil down into a "soup" of logical assertions.
To make this more concrete, let's have a look at an RDF description of part of an email inbox.
Graph of Email Inbox

The email resource itself is named by the mid:339.C2@foo.com URI (derived from the Message-Id header field), and it has various properties that are either literal values or resources themselves. In particular, we've decided to name the author of the email by their email address and give them a "real name" property.
RDF, as written in XML, is a syntactical formulation of these graphs.
So far, this is fairly straightforward. If it's so neat, what can it do?
I can ask questions about my email
Who sent me mail on a particular topic?
Get me all the mail from Fred Smith
Who where the people who I mailed with on Friday?
I can join up my email graphs with other ones:
Address books
Home pages
Browser history
Organizational affiliations
Of course, RDF is just a way of representing data, and you need query engines to give you answers to these questions. This is an area that's growing at the moment, and there are several great open source projects available. Some of them are based on Prolog, which is great if you're inclined that way, but others are C and Java-based, more oriented toward popular programming languages.
The real power is being able to join together graphs from multiple data sources. The use of URIs for names enables them to act as connectors over a potentially infinite data source. An RDF processor could then chase down bits of information by following these links. Imagine layering an RDF graph over Amazon.com for instance. You could construct the "is-similar-author-to" property over all book authors by using the detail from the "customers who bought this book also bought..." information they present.
If we take a step back from this connected chain of information for a second, it might sound familiar: a standard way of writing information, connected together via universally unique names. In many ways, RDF does the same thing for computers that HTML does for humans. The HTML-web has enabled humans to chase down various bits of information through links and query engines: widespread RDF on the web will enable computers to do that. What matters most of all is the linking through universal names.
RDF Processing Models
What kinds of computation can be done with RDF? How will this web of information actually work? This is where we definitely walk into the world of prototypes and experimentation. The basic method of processing generally involves aggregation as a first step. Here, RDF sources are mined for their descriptions, and these bundled into a local store of some sort. From there, queries can be performed on the data.
A slightly more sophisticated architecture may involve some kind of dynamic description generation or querying. Let's use Amazon.com as our example again. If I wanted to run a query on Harry Potter books, in order to see which books are in a similar genre, I do not want to import Amazon.com's entire catalog. Furthermore, Amazon.com doesn't want me to import their entire catalog. Instead, they may just give me access to a virtual graph, which I can query without having to construct.
In general when processing RDF the logic tends to be performed in one place, while the source data can be widely distributed.
RDF Infrastructure Requirements
As with most technologies, RDF requires other things to be in place to support its widespread use. These include:
Vocabularies: there is little use being able to talk to each other unless we understand what we mean when we use a particular phrase. If I can retrieve the fact that the "car is red", I don't really have any useful information unless I know precisely what "car", "red", and "is" mean.
Query languages: once we have constructed our database of information, we need query languages and standard APIs to use the data. This is an area in which active development is being pursued in RDF, but we are still some way from having something as mature as SQL.
Data stores: developers constructing systems using RDF shouldn't have to worry about how they will store their data, storage needs to be a "drop-in" component. Like querying and APIs, this is an area under active development, and projects like R.V. Guha's rdfdb and Dave Beckett's Redland show promise.
Characterization: this is a very interesting issue concerning query and inference-based systems that use RDF. Where are the bounds of what I know about? How do I find out what other people know about, and how can I express those bounds? Issues like this become important when attempting to link up multiple sources of information.
Pages: 1, 2 |