XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Building the Semantic Web

March 07, 2001

This article is adapted from the closing keynote I delivered at Knowledge Technologies 2001.

Introduction

The range of people working under the broad umbrella of the Semantic Web come from many diverse communities, from the Web-focused to experienced researchers in the fields of artificial intelligence and knowledge representation. Ultimately the skills of all those involved will be required, and it's definitely beyond the scope of any one group to provide the expertise necessary to build the ultimate Semantic Web.

For me, the key thing about the Semantic Web is the word "Web". It's our essential starting point, and the Web at large is the ecology in which the primordial Semantic Web must grow. I spend most of my time working with the Web, as a developer and a writer, and also in involvement with the community of developers and publishers that use the Web.

So, as I approach the Semantic Web (or "SW" from here on), I'm always asking the question "how do we get this started?" There are many interesting and exciting possibilities in the realms of logic and proofs, but getting them running on the Web must be preceded by getting more basic machine processible content out there. The evolving form of the SW has to crawl before it can run.

In this article I introduce the SW vision and explore the practical steps that we need to be taking to build it.

What is the Semantic Web?

The essential aim of the SW vision is to make Web information practically processible by a computer. Underlying this is the goal of making the Web more effective for its users. This increase in effectiveness is constituted by the automation or enabling of things that are currently difficult to do: locating content, collating and cross-relating content, drawing conclusions from information found in two or more separate sources.

In the software world we can often get so enthusiastic about the systems that we're creating that we stray from a focus on the user's requirements. One of the great things about the Web is that it's unforgiving when we ignore the user. Create a site that's hard to use and nobody will come. Create a technology for page markup that's difficult to grasp and nobody will use it. In fact, you might see the creation and implementation of the SW as a near impossible task: it's still difficult to get people to use as little metadata as the <title> tag in their web pages.

Clearly, to get off the starting blocks, the SW has to offer enough in reward to make it worth people's time to learn new skills and to more carefully deploy their content on the Web.

So, that's the vision. A Web that machines can understand to make our lives easier. If you accept that the end purpose of the SW is to make your life easier, then the use cases spring from your frustrations. Some of the common problems we want to solve on the Web revolve around interoperability of data. Synchronize your Palm Pilot's schedule with a web page, have some kind of universal view over your email, documents, and web browsing history. These problems are currently unsolved because of the fragmentation of our data due to custom and proprietary data formats. Providing an integration of these is an obvious use case.

As well as meeting some obvious use cases, there's a degree of serendipity in the SW work. There's a feeling that says, "if only we got all these sources of information tied together, than exciting things would happen!" Building the SW is a research and development project, not a manufacturing process. There'll be some dead ends, and there'll be some discoveries of exciting and unforeseen proportions.

Speaking personally, I have a fundamental excitement at being able to recover and integrate my data from disparate sources and proprietary formats. This springs from constraints on my time, the difficulty of finding information, and the redundancy of having my data scattered across multiple devices. In what follows I give an explanation of each layer in Tim Berners-Lee's vision of the SW: each layer gives progressively more value; each is exciting in its own right. My current aims for the SW result purely from the implementation of some of the lower layers.

Overview of the Semantic Web

The World Wide Web Consortium has recently started a specific Activity to address SW development. Under the leadership of Eric Miller, its remit is twofold: to develop and address issues with RDF and RDF Schema; to coordinate with other W3C groups using RDF; and to undertake and encourage "advanced development" of SW software.

This latter aim is the thing I find most exciting. "Advanced development" entails the W3C working with developers in an open fashion to encourage SW-related projects and to give them a focus. Early projects that might cluster around mandate include some work inside the W3, such as RDF wrappers for CVS repositories, and potentially some existing community-based projects could have a home there. Essentially, "advanced development" is a recognition of what has happened to the RDF world in the last year. While it essentially languished for a while at the W3C in terms of formal activity, a community has grown up, with some very encouraging results.

The W3C has put forward a very clear architecture for the SW, described by Berners-Lee at XML 2000 in Washington last year. This architecture is cleanly layered, starting with the foundation of URIs and Unicode. On top of that sits syntactic interoperability in the form of XML, which in turn underlies what I like to think of as the data interoperability layer, RDF and RDF schemas. Those layers sum up most of the SW that's presently available in implementation form. And without looking further up the SW stack, an extraordinary amount of utility can and has been obtained from just those layers.

You'll notice that digital signatures run right up the side of the stack, emphasizing their widespread utility. At each stage they allow content from a layer to be labeled with an assured provenance. Digital signatures are critical to both the SW and the growing use of XML in other message exchanges. From the basic act of signing some RDF assertion ("I said this!") to signing proofs, they add a level of assurance to the Web that hasn't existed thus far.

On top of RDF lie ontologies, which allow the further description of objects and their interrelations, past the basic class-property descriptions enabled by RDF Schema. The W3C in conjunction with DARPA and the European Union is pursuing the development of languages in this area right now. Ontologies provide the ability to say "my world is like this" and are the foundation that will enable programs to reason about different worlds and environments and make connections between them.

The logic layer will provide an interoperable language for describing the sets of deductions one can make from a collection of data -- how, given the world we've now neatly described, we can make connections and derive new facts about it. The proof language will provide a way of describing the steps taken to reach a conclusion from the facts. These proofs can then be passed around and verified, providing short cuts to new facts in the system without having each node conduct the deductions themselves.

The SW vision is that once all these layers are in place, we will have a system in which we can place trust that the data we are seeing, the deductions we are making, and the claims we are receiving have some value. That's the the goal: to make a user's life easier by the aggregation and creation of new, trusted information over the Web.

Goals for Building the SW

Now that we've seen the plan, let's look at how it's going to be built. Obviously, the technology needs to be invented. But technology without adoption is dead. What SW advocates need to do to reach the critical points along the road to adoption?

Eric Miller, SW Activity Lead, certainly has his job cut out. While there are encouraging signs of a groundswell in support for RDF, it mostly has a bad name and reputation at the moment. Take this along with the confusion that XML namespaces, an underlying layer, generates (and never mind that many US programs can't even work with European Latin character sets, much less Unicode) and there are some steep slopes to climb.

So one of the first aims of SW advocates must be to promote understanding of what they're doing, at both low and high levels. RDF is more than an obscure or verbose way to write what you could do easily in XML. There are reasons for using it. Naming everything with URIs is in fact very powerful, but the confusion about the use of the http: prefix for unretrievable resources needs to be cleared up.

But it would be a mistake to focus on getting all developers (much less users) to understand fundamentally every layer of this stack. The fact is that most developers use prepared modules to do their construction work; only a few are extreme enough to bake their own bricks. An aid and impetus to getting understanding is to get implementation. It's very reasonable for people to ask, "what does this do for me?" about a new technology. Implementations can speak louder than a thousand specifications.

Implementations fall into two categories: (1) deployment of SW technologies in a vocabulary or framework and (2) software tools. The growth in basic RDF tools over the last year has been very pleasing. These tools are starting to reach the level of maturity at which I would consider basing an application on one or two of them. Likewise the deployment of RDF in vocabularies like PRISM and RSS is encouraging and has reaped particular benefits that straight XML serializations often miss.

We should be careful not to restrict SW technologies to just those explicit layers in Berners-Lee's idealized diagram. There's obviously a difference between what is on the Web, and what is in the diagram (HTML is not mentioned, for instance). The beauty of XML is that it's in the perfect place to act as a bridge. HTML (or more properly XHTML) can be semantically decorated by means of things like the class attribute, and XSLT can be used to extract RDF. Likewise, there are other semantic applications, such as Topic Maps, that are pure XML applications. Are these to be excluded from the SW? No, XML provides a bridge.

Picture RDF as providing an interoperable data bus for the SW. Some data sources may need a converter to connect, but it doesn't stop them connecting. And once they're patched in, there's a lot of potential in the resulting integration.

So the W3C has to promote understanding and implementation among the community. What about money? Surely you can't reach critical mass without there being money in it?

Yes, there has to be commercial value somewhere down the line; business is after all about providing services to users. But we ought to be wary of the effect of premature intense commercial interest. On the one hand, look at the W3C's greatest successes: the Web itself was built while nobody in particular was looking. XML 1.0 developed similarly: "fast, low and under the radar," as Tim Bray likes to say. On the other hand, the effect of large-scale corporate interest on XML Schema has been significant, causing the end result to be late and an obviously overcomplicated result of design-by-committee.

Recipes for Success

What does getting the SW right entail? There's a lot we can learn from the existing Web itself, which has been outrageously successful. As the SW is to be built on top of the Web, many of its characteristics are there as a base and should be continued. The Web provides the ecology in which the SW must thrive, not destroy. So what are these characteristics?

  • Simple protocols, concepts and syntax: the easier the component parts of the SW are to learn, the quicker they will spread in adoption. Of course there is a tension here, but on the Web widespread adoption is something that can be set against complexity. There is ultimately more power in a simple technology universally adopted than a more powerful one with patchy or little adoption.
  • Low barrier to access: the SW should be something which normal users have easy access to, in the same way that it's very easy to read the Web, and relatively easy to set up and publish a web page. We run into tool-dependencies here, but that's not a blocker, as most non-HTML savvy folk use an authoring tool to publish. The point is that SW technology must become commoditized.
  • Tangible utility: this may seem obvious, but the Web actually does something people want. There's a danger with the SW, as with any technology, that its developers get carried away with ideas that end up being clever but hardly useful. The use cases for the SW must begin at home and describe pratical problems.

So is this a private W3C party? Judging by the way the new SW Activity is set up, the W3C has recognized it's not and wishes open community involvement in the effort. The importance of this community should not be underestimated. Over the last year there have been at least two community-driven efforts already building the SW that've caught my attention. Their use cases in each instance described practical problems that the developers had to solve to help in their work.

RDDL, covered recently in XML.com, allows developers to place a machine-readable description document at the end of a namespace URI to allow processors to discover resources related to a namespace. RSS 1.0 is a web content metadata distribution format. Its extensibility allows it to be used in many situations far beyond original use cases. Both these projects fill in a little bit of the picture for the SW and represent chunks of what is to come. In the context of success for the SW, they're notable because they solved direct needs and extensibly allow reuse and expansion into areas that the designers didn't foresee -- a direct reflection of the development of the Web itself.

Hitting That Fabled 80/20 Point

To conclude, it's important that the builders of the SW keep their feet on the ground. The next generation of the Web will be built cooperatively and in a distributed manner. Rather than pondering grand unification theories, we should concentrate on doing small things well and solving achievable and well-defined problems. Good and open implementation in addition to good design is key. Furthermore, the longer development can stay "fast, low and under the radar," the better.

The SW represents an enormous opportunity not just to solve our problems with information management, but also to solve them in an interoperable environment, so we can all share solutions and enjoy the network effect. But always the goal should be to make the Web more effective for the user, and it is by such that it will be judged.