Distributed XML

September 6, 2000

The role played by XML in the next-generation Web

Introduction

It's not cool to be different, at least not where Internet computing is concerned. Despite widespread agreement about certain subsections of Internet technology -- SMTP for email, for instance -- many services and sources of data remain desperately unconnected. The same is equally true of desktop computing. Although office suites provide some degree of integration, exchanging data between applications from different vendors is very frustrating. Add the Web and email into the mix and the problem gets worse.

The more I use and rely on computers, the more I realize I ought to stand up for my rights. I'm abused and tormented by a patchwork of programs that hardly work together, that trap my data in places I don't want, and make me adopt unnatural working styles. The busier I get, the more I get buried in information overload, the more I realize this is happening, and the more I want it fixed.

XML offers hope for escape from the current situation of fragmentation and disarray. In this talk I will focus on two technologies that look as though they'll have a big impact in this area: SOAP and RDF. I'll also talk about the shift in architecture, from centralized to decentralized, that we'll need to embrace as the world of Internet computing continues to grow.

The Dream

The dream that drives the integrated vision of the future is of a universal homogeneous view of information. No special cases or peculiar formats but a universally accessible "data bus" over the realm of the Internet, and by extension, all your private data sources. Let's look at the components of such a system. Fundamentally, they are a universal addressing scheme and a universal data format. In many ways these represent the essential components of a universal computer.

The addressing scheme, universal resource identifiers (URIs), has been in operation over the web for a long time now. The data format, XML, has been around for nearly three years, and it's clearly providing many benefits in reducing the translation overheads of communication within and between organizations.

We're now close to the conditions in which computing can be performed over the whole span of the Internet. However, every computer requires instructions and a language in which to program. These are the problems we need to work on now in order to realize the greater promise.

One of the luxuries of being able to address a conference full of XML developers is the chance to urge you to check out some new ideas. I hope that some of this dream will catch in your minds.

The Universal Computer

I've already mentioned that URIs are pretty much set in place as an addressing scheme, and it looks increasingly like HTTP is becoming the transport for information and instructions in our "universal computer". Yet this still leaves a couple of problems left unsolved:

How do we encode the data? Plain old XML won't do by itself -- how do we say something is an integer, something represents a "Person", etc.?
How do we encode instructions? If we want to cause another computer somewhere else on the web to perform a function, how do we express that?

These questions lead me to talk about two technologies that go some way to answering these questions: RDF and SOAP.

Characterizing RDF

RDF, the Resource Description Framework, is a technology invented at the W3C. It was one of the earliest XML applications, and definitely the first to use XML namespaces in earnest. For various reasons, its rise has not been the up-and-up that XML itself, and more recently XSLT, has achieved, but, rather, a slow but steady expansion. Its immediate user community has not been e-commerce, which has also been an important factor.

What is RDF for, and what are its qualities? Well, it does what it says on the package: RDF is a language for describing things. Well, one might object, you can describe things in XML anyway, so why do you need more? The answer is that XML is too flexible, and you need conventions. As an example, there are many ways to indicate the color of something in XML. Try to describe a "red car":

<car color="red" />

<car><color>red</color></car>

<car color="#cc" /><color id="cc" shade="red" />

I just came up with these three on the spur of the moment, there are lots of other ways you could write that fact down. What RDF does is to invent a standard way of interpreting XML-encoded descriptions of things, or "resources", which turns out to be very useful.

Further, RDF employs URIs as a naming scheme. This means that there's one naming convention, which has the property of being able to generate globally unique names for your resources. This is another important feature RDF needs to be able to model the real world. (Incidentally, it also shines the spotlight on the fact that naming anything is very hard indeed!) One consequence of using URIs is that it enables RDF to be used in a decentralized fashion: unique names need no context to qualify them, so anyone anywhere can write descriptions that involve anything anywhere.

Another key feature of RDF is that it's openly extensible. Unlike plain old XML, there's no sense of constraining what the document can describe by a DTD or schema. This means that if you get a description of something from someone, and want to add your own observations to it, you can do so without having to agree to a change in the schema. For certain classes of application& -- particularly annotation and metadata applications -- this is a great advantage. It also means that systems using RDF run less risk of getting stuck in a legacy file format situation: the ability to cope in a forward-compatible manner usually comes hand-in-hand with using RDF.

Essential RDF

Let's take a look at some of the essentials of RDF. At its most basic, RDF is a way of modeling things. A lot XML technologies tend to start at the syntax and go from there (following the lead of XML itself). To get a good understanding of RDF, it's better to start with the model.

Everything in RDF can be represented by a graph with nodes and arcs. Each node is a resource, and each arc represents a property. Both properties and resources are named with URIs. What does this mean? It means that the whole Web and beyond (in short, anything which you can name) is within the scope of an RDF description. In effect, RDF graphs boil down into a "soup" of logical assertions.

To make this more concrete, let's have a look at an RDF description of part of an email inbox.

Graph of Email Inbox

Model of email inbox

The email resource itself is named by the mid:339.C2@foo.com URI (derived from the Message-Id header field), and it has various properties that are either literal values or resources themselves. In particular, we've decided to name the author of the email by their email address and give them a "real name" property.

RDF, as written in XML, is a syntactical formulation of these graphs.

So far, this is fairly straightforward. If it's so neat, what can it do?

I can ask questions about my email
- Who sent me mail on a particular topic?
- Get me all the mail from Fred Smith
- Who where the people who I mailed with on Friday?
I can join up my email graphs with other ones:
- Address books
- Home pages
- Browser history
- Organizational affiliations

Of course, RDF is just a way of representing data, and you need query engines to give you answers to these questions. This is an area that's growing at the moment, and there are several great open source projects available. Some of them are based on Prolog, which is great if you're inclined that way, but others are C and Java-based, more oriented toward popular programming languages.

The real power is being able to join together graphs from multiple data sources. The use of URIs for names enables them to act as connectors over a potentially infinite data source. An RDF processor could then chase down bits of information by following these links. Imagine layering an RDF graph over Amazon.com for instance. You could construct the "is-similar-author-to" property over all book authors by using the detail from the "customers who bought this book also bought..." information they present.

If we take a step back from this connected chain of information for a second, it might sound familiar: a standard way of writing information, connected together via universally unique names. In many ways, RDF does the same thing for computers that HTML does for humans. The HTML-web has enabled humans to chase down various bits of information through links and query engines: widespread RDF on the web will enable computers to do that. What matters most of all is the linking through universal names.

RDF Processing Models

What kinds of computation can be done with RDF? How will this web of information actually work? This is where we definitely walk into the world of prototypes and experimentation. The basic method of processing generally involves aggregation as a first step. Here, RDF sources are mined for their descriptions, and these bundled into a local store of some sort. From there, queries can be performed on the data.

A slightly more sophisticated architecture may involve some kind of dynamic description generation or querying. Let's use Amazon.com as our example again. If I wanted to run a query on Harry Potter books, in order to see which books are in a similar genre, I do not want to import Amazon.com's entire catalog. Furthermore, Amazon.com doesn't want me to import their entire catalog. Instead, they may just give me access to a virtual graph, which I can query without having to construct.

In general when processing RDF the logic tends to be performed in one place, while the source data can be widely distributed.

RDF Infrastructure Requirements

As with most technologies, RDF requires other things to be in place to support its widespread use. These include:

Vocabularies: there is little use being able to talk to each other unless we understand what we mean when we use a particular phrase. If I can retrieve the fact that the "car is red", I don't really have any useful information unless I know precisely what "car", "red", and "is" mean.
Query languages: once we have constructed our database of information, we need query languages and standard APIs to use the data. This is an area in which active development is being pursued in RDF, but we are still some way from having something as mature as SQL.
Data stores : developers constructing systems using RDF shouldn't have to worry about how they will store their data, storage needs to be a "drop-in" component. Like querying and APIs, this is an area under active development, and projects like R.V. Guha's rdfdb and Dave Beckett's Redland show promise.
Characterization: this is a very interesting issue concerning query and inference-based systems that use RDF. Where are the bounds of what I know about? How do I find out what other people know about, and how can I express those bounds? Issues like this become important when attempting to link up multiple sources of information.

Characterizing SOAP

If RDF is the Prolog of XML, then SOAP is its Java. While RDF's heritage is the declarative disciplines of knowledge representation and logic programming, SOAP's heritage is in imperative, "conventional", object-oriented culture.

At its most basic level, SOAP is a set of rules for representing data in XML. Given a data structure, SOAP prescribes an agreed-upon serialization of it. That may sound incredibly similar to the basic explanation of RDF that I gave: indeed for any given graph, the RDF and SOAP representations are more or less identical. (Henrik Frystyk Nielsen gave a presentation on this at the 9^th International World Wide Web conference in Amsterdam earlier this year.)

Beyond serialization of data, SOAP was created with messaging as its target application, providing an over-the-wire representation for messages. In contrast to RDF documents, which tend just to "sit there" until a processing application comes along, SOAP documents are actively passed in between computers (the destinations known as "endpoints"). SOAP further provides a mapping for those message exchanges to implement a remote procedure call mechanism.

Because of the defined semantics of what happens when a computer receives a SOAP message, SOAP servers will also publish contracts about what they will and will not accept.

Essential SOAP

SOAP is a protocol for serializing data and wrapping it in an envelope so it can be transported between endpoints. Like RDF it attempts to bring some order to things that could be done in many ways. The scenario it addresses is how to perform machine-to-machine communication using XML?

This scenario breaks down into two sub-problems:

Encoding: each machine must use the same way of representing data types and wrapping up the message
Protocol: each machine must use the same rules of choreography for message exchange

Let's look at a simple example of a SOAP message:

Encoding example


<x:PurchaseOrder>

  <x:CustomerName>Henry Ford</x:CustomerName>

  <x:ShipTo>

    <a:Street>5th Ave</a:Street>

    <a:City>New York</a:City>

    <a:State>NY</a:State>

    <a:Zip>10010</a:Zip>

  </x:ShipTo>

 <x:PurchaseLineItems>

    <x:Order>

     <x:Product>Apple</x:Product>

      <x:Price>1.56</x:Price>

    </x:Order>

    <x:Order>

      <x:Product>Peach</x:Product>

      <x:Price>1.48</x:Price>

    </x:Order>

  </x:PurchaseLineItems>

</x:PurchaseOrder>

(The namespace prefixes x and a are assumed to be bound to some meaningful namespace URI. Note the use of namespaces here, as in RDF they allow global specification of the semantics of a particular element).

SOAP Processing Models

SOAP's typical deployment scenarios are different than RDF's. SOAP is most definitely in the "enterprise" buzzword camp, and being pushed in relation to e-business services. Aside from carrying business messages between servers, one aspect of SOAP receiving much attention is its ability to perform RPC-over-HTTP. This feature has received frowns from many departments, especially those conscious of network security.

The worries center around the naïve deployment of a SOAP server: default bindings might expose all manner of internal services to the outside world via port 80. The incentive for the developer is that deploying SOAP services is a lot simpler than CORBA or DCOM, along with none of that pesky wrangling with the network administrator. It has been pointed out, however, that firewall technology will simply get enhanced to sniff the contents of normal HTTP traffic in order to be assured that only allowed SOAP requests are passed through. This still doesn't solve the case of SOAP endpoints exposed over SSL—there's a whole can of worms that SOAP opens, which security experts are still peering into.

RPC with SOAP, maligned or otherwise, fits in well with the changing styles of programming accompanying the increase in web applications. Use of interpreted scripting languages (like Perl, Javascript, Python) is on the rise, with a tendency towards a number of small programs with well-defined responsibilities. The platform independence and lack of prerequisite machinery makes SOAP an attractive option for interprogram communication in this scenario. However, one shouldn't underestimate what is needed for a fully-fledged distributed object system, such as CORBA, and these complications will surely come.

Comparing SOAP's processing model to RDF's, we can see that with SOAP it's the documents that do the walking, and the computation is distributed over multiple computers. Functionality, rather than data, is what gets aggregated in the SOAP model. With that comes some hard problems to solve, too, mainly in the areas of latency and reliability, which is what leads me to suspect that SOAP will find its immediate home most comfortably in predictable network situations.

SOAP Infrastructure Requirements

Like RDF, SOAP also depends on known vocabularies in order to communicate with predictable semantics. Depending on the scenarios of use, vocabularies used with SOAP can have the following scopes:

private: in a pre-arranged one-off situation, a custom vocabulary can be agreed among the parties involved
industry-wide: vocabularies that are specific to particular industries, often maintained by an industry standards body
global: reusable vocabularies that cut across all industries and spheres of use

Because SOAP documents are intended to travel, they require additional infrastructure, including servers which route messages around and deliver them to the correct software components. The facilities offered by these software components also need a description language (of which there are currently two competing specifications, one each from IBM and Microsoft), and also a means of discovery for these software interfaces.

Incidentally, interface description and discovery are great applications for RDF technology.

SOAP/RDF Contrasts Revisited

Now we've seen the capabilities of both SOAP and RDF, we'll compare them once again, and see how they complement each other.

At the serialization level, SOAP and RDF are practically identical. They both deal with XML-encoded data.
In fact, one useful application of SOAP would be to carry RDF descriptions around in between RDF databases, supporting the RDF aggregation process
SOAP and RDF start to diverge when you look at their components outside of basic serialization:
- The use of URIs: RDF insists on URIs for all names. SOAP can use namespaces when needed. As a consequence, RDF has only one scope, the global one, whereas SOAP documents may live in different scopes, relying on the context of operation for the semantic interpretation of names.
- RPC: the ability to be used for RPC is a unique facet of SOAP. Although "SOAPists" now downplay this, there's no doubt that RPC-over-HTTP is a major attraction of the technology, and it is one with practical uses.
- Marketing : the two technologies are marketed very differently and aimed at different spheres, although they are not worlds apart either in philosophy or in terms of the people who worked on creating both technologies.

Their Place in the Future

In sum, SOAP provides a web-aware alternative to current object protocols like CORBA. It has a low cost of deployment and is supported by software right now. It still has issues to face in terms of interoperability, security, and description/discovery infrastructure.

RDF implements a computer-readable alternative to current web knowledge representation applications (i.e., HTML). It faces some immediate challenges in terms of intelligibility and immediate business uses are less than certain. In the long run, though, it presents the opportunity to transform the way the web is used.

Looking at the big picture, one can envisage SOAP and RDF operating in a complementary manner in the Web of the future. RDF-based technology can provide directory information to describe and locate SOAP services. SOAP could carry RDF graphs in between RDF aggregation services, or provide a "virtual graph" service from a provider like Amazon.com.

Both SOAP and RDF have a part to play in my dream of a totally integrated future. However, they also point to the need for some very significant work, only just getting started, on agreeing upon XML vocabularies and semantics. That is a hard problem, one which I expect will never be totally solved, and may cause us to develop the best "nearly-there" solutions we can, to continue getting the most out of the Web.