XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

RDF and Metadata

June 09, 1998

This article has now been updated to incorporate changes in the RDF spec and the growth of the RDF community. You can find a newer version here: What is RDF?.

The Right Way to Find Things

RDF stands for Resource Description Framework. RDF is built for the Web, but let's leave Web-land behind for a few minutes and think about how we find things in the real world.

Scenario 1: The Library

You're in a library to find books on raising donkeys as pets. In most libraries these days you'd use the computer lookup system, basically an electronic version of the old card file. This system allows you to list books by author, title, call-number, and subject. The list includes the date, author, title, and lots of other useful information, including (most important of all) where each book is.

Scenario 2: The Video Store

You're in a video store and you want a movie by John Huston. A large modern video store offers a lookup facility that's similar to the library's. Of course, the things you can search on are different (director, actors, and so on) but the results are more or less the same.

Scenario 3: The Phone Book

You're working late at a customer's office in South Denver, and it seems that a pizza is essential if work is to continue. Fortunately, every office comes equipped with a set of Yellow Pages that, properly used, can lead to quick pizza delivery.

The Common Thread

What do all these situations have in common, and what differences lie behind the scenes? First of all, each of these systems is based on metadata, or information about information. In each case, you need a piece of information (the book's location, the video's name, the pizza joint's phone number). In each case, you use metadata (information about information) to get it.

We're all used to this stuff; the usual setup is that metadata comes in named chunks (subject, director, business category) that associate lookup information ("donkeys", "John Huston", "Pizza, South Side") with the real info that you're after.

Here's a subtle but important point: in theory, metadata is not really necessary. In principle, you could go through the library one book at a time looking for donkey books; or through the video store shelves until you found your movie; or call all the numbers in your area code until you find pizza delivery. But that would be very wasteful -- in fact, downright stupid. Metadata is the way to go.

It's All Different Behind the Scenes

In each of our scenarios, we used metadata, and used it in a remarkably similar way. Does this mean that the library, the video store, and the phone company all use the same metadata setup? Of course not -- to start with, every library has a choice among at least two systems for organizing their books, and among many vendors who will sell them software to do the looking-up. The same is true, obviously, for video stores and phone companies.

In fact, most such products define their own system of metadata and their own facilities for storing and managing it; they typically do not offer any facilities for sharing or interchanging it. This doesn't cause too much of a problem, assuming they do a decent job with the user interface. We are comfortable enough with the general process we call "looking things up" (really, searching via metadata) that we are able to adapt and use all these different systems.

Not Just For Searching

The most common day-to-day use of metadata is to help us find things. But there are lots of other uses going on behind the scenes: the library and video store are both keeping other metadata that you don't see, concerning how often the books and videos are being used, how much it cost to buy them, where to go for a replacement; running a library or a video store would be unthinkable without metadata. Similarly, the phone company, of course, uses its metadata, most obviously to print the Yellow Pages, but for many other internal management and administration tasks.

What About the Web?

The Web is a lot like a really REALLY big library, in that there are millions of things out there, and if you know the URL (in effect an electronic "call number") you can get them. Since the Web has books, movies, and pizza joints, the number of things that you might need to look things up by includes all the things a library uses, plus all the things the video store uses, plus all the things the Yellow Pages use, and lots more.

The problem at the moment is that there is hardly any metadata on the Web. So how do we find things? Mostly, using dumb brute-force techniques. The dumb brute force is supplied by the Web robots of search engine sites like Altavista, Infoseek, and Excite. These sites do the equivalent of going through the library, reading every book, and allowing us to look things up based on the words in the text. It's not surprising that people complain about search results, or that the robots are always way behind the growth and change of the Web.

In fact, there is one metadata-based general purpose lookup facility: Yahoo!, which is the most visited Web site of all. Yahoo doesn't use a robot. When you search through Yahoo, you're searching through human-generated subject categories and site labels. Compared to the amount of metadata that a library maintains for its books, Yahoo! is pitiful; but its popularity is clear evidence of the power of (even limited) metadata.

Divine Metadata for the Web

People who have thought about these problems, and including many of the world's librarians and webmasters, generally agree that the Web urgently needs metadata. What would it look like? If the Web had an all-powerful Grand Organizing Directorate (at www.GOD.org), they would think up a set of lookup fields such as Author, Title, Date, Subject, and so on. The Directorate, being, after all, GOD, would simply decree that all Web pages start using this divine Metadata, and that would be that. Of course there would be some details such as how the Web sites ought to package up and interchange the metadata, and we all know that the Devil is in the details, but GOD can lick the Devil any day.

In fact, there is no www.GOD.org. For this reason, there is no chance that everyone will agree to start using the same metadata facilities. If libraries, which have been existence for thousands of years, can't agree on a single standard, there's not much chance that the Web will.

Does this mean that there is no chance for metadata? That everyone is going to have to build their own lookup keys and values and software, and that we're going to be stuck using dumb brute-force robots forever?

No -- because as we observed with our three search scenarios, metadata operations have an awful lot in common, even when the metadata is different. RDF is an effort to identify these common threads and provide a way for Web architects to use them to provide useful Web metadata without divine intervention.

Introducing RDF

Resource Description Framework, as its name implies, is a framework for describing and interchanging metadata. It is built on the following rules:

  1. A Resource is anything that can have a URI; this includes all the world's Web pages, as well as individual elements of an XML document. An example of a resource is a draft of the document you are now reading and its URL is http://www.textuality.com/RDF/Why.html
  2. A PropertyType is a Resource that has a name and can be used as a property, for example Author or Title. In many cases, all we really care about is the name; but a PropertyType needs to be a resource so that it can have its own properties.
  3. A Property is the combination of a Resource, a PropertyType, and a value. An example would be: "The Author of http://www.textuality.com/RDF/Why.html is Tim Bray." The Value can just be a string, for example "Tim Bray" in the previous example, or it can be another resource, for example "The Home-Page of http://www.textuality.com/RDF/Why.html is http://www.textuality.com."
  4. There is a straightforward method for expressing these abstract Properties in XML, for example:
<RDF:Description href='http://www.textuality.com/RDF/Why-RDF.html'>
<Author>Tim Bray</Author> 
<Home-Page RDF:href='http://www.textuality.com' />
</RDF:Description>

RDF is carefully designed to have the following characteristics:

Independence
Since a PropertyType is a resource, any independent organization (or even person) can invent them. I can invent one called Author, and you can invent one called Director (which would only apply to resources that are associated with movies), and someone else can invent one called Restaurant-Category. This is necessary since we don't have www.GOD.org to take care of it for us.
Interchange
Since RDF Properties can be converted into XML, they are easy for us to interchange. This would probably be necessary even if we did have www.GOD.org.
Scalability
RDF properties are simple three-part records (Resource, PropertyType, Value), so they are easy to handle and look things up by, even in large numbers. The Web is already big and getting bigger, and we are probably going to have (literally) billions of these floating around (millions even for a big Intranet), so this is important.
PropertyTypes are Resources
This means that they can have their own properties and can be found and manipulated like any other Resource. This is important because there are going to be lots of them; too many to look at one by one. For example, I might want to know if anyone out there has defined a PropertyType that describes the genre of a movie, with values like Comedy, Horror, Romance, and Thriller. I'll need metadata to help with that.
Values Can Be Resources
For example, most Web pages will have a property named Home-Page which points at the home page of their site. So the values of properties, which obviously have to include things like title and author's name, also have to include Resources.
Properties Can Be Resources
So they can have properties too. Since there's no www.GOD.org to provide useful assertions for all the resources, and since the Web is way too big for us to provide our own, we're going to need to do lookups based on other people's metadata (as we do today with Yahoo!). This means that we'll want, given any Property such as "The Subject of this Page is Donkeys", to be able to ask "Who said so? And When?" One useful way to do this would be with metadata; so Properties will need to have Properties.

Why Not Just Use XML?

XML allows you to invent tags, and for the tags to contain both text data and other tags. Also, XML has a built-in distinction between element types, for example the IMG element type in HTML, and elements, for example an individual <IMG SRC='Madonna.jpg'>; this corresponds naturally to the distinction between PropertyTypes and Properties. So it seems as though XML documents should be a natural vehicle for exchanging general purpose metadata.

XML, however, falls apart on the Scalability design goal. There are two problems:

  1. The order in which elements appear in an XML document is significant and often very meaningful. This seems highly unnatural in the metadata world. Who cares whether a movie's Director or Title is listed first, as long as both are available for lookups? Furthermore, maintaining the correct order of millions of data items is expensive and difficult, in practice.
  2. XML allows constructions like this:
<Description>The value of this property contains some
text, mixed up with child properties such as its temperature
(<Temp>48</Temp>) and longitude 
(<Longt>101</Longt>). [&Disclaimer;]</Description>
    When you represent general XML documents in computer memory, you get weird data structures that mix trees, graphs, and character strings. In general, these are hard to handle in even moderate amounts, let alone by the billion.

On the other hand, something like XML is an absolutely necessary part of the solution to RDF's Interchange design goal. XML is unequalled as an exchange format on the Web; but by itself, it doesn't provide what you need in a metadata framework.

The Devil is in the Details

The four general rules given above define the central ideas of RDF. It turns out that it takes quite a lot of abstract terminology and XML syntax to define them precisely enough that people can write computer programs to process them. In particular, turning Properties into Resources is quite tricky. Also, it turns out that in a (very) few cases, you do need to order your properties, and this requires quite a bit of syntax.

This article is not going to try to explain all these details; there are a variety of excellent resources to be found at http://www.w3.org/RDF that are designed to do just that.

Vocabularies

RDF, as we've seen, provides a model for metadata, and a syntax so that independent parties can exchange it and use it. What it doesn't provide though, is any PropertyTypes of its own. That is to say, RDF doesn't define Author or Title or Director or Business-Category. That would be a job for www.GOD.org, if there were one. Since there isn't, it's a job for everyone.

It seems unlikely that one PropertyType standing by itself is apt to be very useful. It is expected that these will come in packages; for example, a set of basic bibliographic PropertyTypes like Author, Title, Date, and so on. Then a more elaborate set from OCLC, and a competing one from the Library of Congress. These packages are called Vocabularies; it's easy to imagine PropertyType vocabularies describing books, videos, pizza joints, fine wines, mutual funds, and many other species of Web wildlife.

What RDF Might Mean

The Web is too big for anyone person to stay on top of. In fact, it contains information about a huge number of subjects, and for most of those subjects (such as fine wines, home improvement, and cancer therapy), the Web has too much information for any one person to stay on top of and also have a real job.

This means that opinions, pointers, indexes, and anything that helps people "look things up" are going to be commodities of very high value. That is to say, vocabularies. Nobody thinks that everyone will use the same vocabulary (nor should they), but with RDF we can have a marketplace in vocabularies. Anyone can invent them, advertise them, and sell them. The good (or best-marketed) ones will survive and prosper. Probably, most niches of information will come to be dominated by a small number of vocabularies, the way that library catalogues are today.

And even among people who are sharing the use of metadata vocabularies, there's no need to share the same software. RDF makes it possible to use multiple different pieces of software to process the same metadata, and to use a single piece of software to process (at least in part) many different metadata vocabularies.

With any luck, this should make the Web more like a library, or a video store, or a phone book, than it is today.

W3C+Standards template=.stdlist.def::>