What Is RDF

July 26, 2006

Editor's Note: "What Is RDF" was originally written by Tim Bray in 1998 and updated by Dan Brickley in 2001. Recently it seemed like time for another update, particularly to relate RDF and the Semantic Web to the cutting edge of web development. We've republished the original in a new location and offer the following update. I'll leave to you, dear reader, the task of deciding how well Joshua Tauberer has accomplished the task of updating a classic. -- Kendall Grant Clark

Building the Semantic Web

On the Semantic Web (SemWeb), computers do the browsing (and searching, and querying, and...) for us. The SemWeb enables computers to seek out knowledge distributed throughout the Web, mesh it, and then take action based on it. Take an analogy: the current web is a decentralized platform for distributed presentations, while the SemWeb is a decentralized platform for distributed knowledge. Resource Description Framework (RDF) is the W3C standard for encoding knowledge.

There, of course, is knowledge on the current web, but it's off limits to computers. Consider a Wikipedia page, which might convey a lot of information to the human reader, but to the computer displaying the page all it sees is presentation markup. To the extent that computers make sense of HTML, images, Flash, etc., it's almost always for the purpose of creating a presentation for the end user. The real content, the knowledge the files are conveying to the human, is opaque to the computer.

What is meant by "semantic" in Semantic Web is not that computers are going to understand the meaning of anything, but that the logical pieces of meaning can be mechanically manipulated by a machine to useful human ends.

So, now imagine a new web where the real content can be manipulated by computers. For now, picture it as a web of databases. One "semantic" website publishes a database about a product line, with products and descriptions, while another publishes a database of product reviews. A third site for a retailer publishes a database of products in stock. What standards would make it easier to write an application to mesh distributed databases together, so that a computer could use the three data sources together to help an end user make better purchasing decisions?

There's nothing stopping anyone from writing a program now to do those sorts of things, in just the same way that nothing stopped anyone from exchanging data before we had XML. But standards facilitate building applications, especially in a decentralized system. Here are some of the things we would want a standard about distributed knowledge to consider:

1. Files on the Semantic Web need to be able to express information flexibly. Life can't be neatly packed into tables, as in relational databases or hierarchies, as in XML. The information about movies and TV shows contained in the graph below is really best expressed as a graph (see Figure 1):

Figure 1. Knowledge as a graph

Of course, we can't be drawing our way through the Semantic Web, so instead we will need a tabular notation for these graphs that looks a bit like this:

Start Node	Edge Label	End Node
vincent_donofrio	starred_in	law_&_order_ci
law_&_order_ci	is_a	tv_show
the_thirteenth_floor	similar_plot_as	the_matrix
...

Each row of the table specifies an edge from one node in the graph to another. More on this later.

2. Files on the Semantic Web need to be able to relate to each other. A file about product prices posted by a vendor and a file with product reviews posted independently by a consumer need to have a way of indicating that they are talking about the same products. Just using product names isn't enough. Two products might exist in the world both called "The Super Duper 3000," and we want to eliminate ambiguity from the SemWeb so that computers can process the information with certainty. The SemWeb needs globally unique identifiers that can be assigned in a decentralized way.

3. We will use vocabularies for making assertions about things, but these vocabularies must be able to be mixed together. A vocabulary about TV shows developed by TV aficionados and a vocabulary about movies independently developed by movie connoisseurs must be able to be used together in the same file, to talk about the same things (e.g., to assert that an actor has appeared in both TV shows and movies).

These are the requirements that RDF provides a standard for, as we'll see in the next section. Before getting too abstract, here are actual RDF examples of the information from the graph above, first in the Notation 3 format, which closely follows the tabular encoding of the underlying graph:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix ex: <http://www.example.org/> .



ex:vincent_donofrio ex:starred_in ex:law_and_order_ci .

ex:law_and_order_ci rdf:type ex:tv_show .

ex:the_thirteenth_floor ex:similar_plot_as ex:the_matrix .

And in the standard RDF/XML format, which may have a more intuitive feel but tends to obscure the underlying graph:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

    xmlns:ex="http://www.example.org/">

    <rdf:Description rdf:about="http://www.example.org/vincent_donofrio">

        <ex:starred_in>

            <ex:tv_show rdf:about="http://www.example.org/law_and_order_ci" />

        </ex:starred_in>

    </rdf:Description>

    <rdf:Description rdf:about="http://www.example.org/the_thirteenth_floor">

        <ex:similar_plot_as rdf:resource="http://www.example.org/the_matrix" />

    </rdf:Description>

</rdf:RDF>

RDF was originally created in 1999 as a standard on top of XML for encoding metadata--literally, data about data. Metadata is, of course, things like who authored a web page, what date a blog entry was published, etc., information that is in some sense secondary to some other content already on the regular web. Since then, and perhaps especially after the updated RDF spec in 2004, the scope of RDF has really evolved into something greater. The most exciting uses of RDF aren't in encoding information about web resources, but information about and relations between things in the real world: people, places, concepts, etc.

Triples for Knowledge

RDF provides a general, flexible method to decompose any knowledge into small pieces, called triples, with some rules about the semantics (meaning) of those pieces.

The foundation is breaking knowledge down into a labeled, directed graph. Each edge in the graph represents a fact, or a relation between two things. The edge in the example from the node vincent_donofrio labeled starred_in to the node the_thirteenth_floor represents the fact that actor Vincent D'Onofrio starred in the movie "The Thirteenth Floor." A fact represented this way has three parts: a subject, a predicate (i.e., verb), and an object. The subject is what's at the start of the edge, the predicate is the type of edge (its label), and the object is what's at the end of the edge. (Technically RDF can express some things that a graph can't, but I won't get into that here.)

The six documents composing the RDF specification tell us two things. First, it outlines the abstract model, i.e., how to use triples to represent knowledge about the world. Second, it describes how to encode those triples in XML.

Most of the abstract model of RDF comes down to four simple rules:

A fact is expressed as a Subject-Predicate-Object triple, also known as a statement. It's like a little English sentence.
Subjects, predicates, and objects are given as names for entities, also called resources (dating back to RDF's application to metadata for web resources) or nodes (from graph terminology). Entities represent something, a person, website, or something more abstract like states and relations.
Names are URIs, which are global in scope, always referring to the same entity in any RDF document in which they appear.
Objects can also be given as text values, called literal values, which may or may not be typed using XML Schema datatypes.

You've seen statements already. Each row in the triples table above, or in the example N3 file, was a fact. This satisfies our need for being able to represent knowledge as a graph.

Entities are named by Uniform Resource Identifiers (URIs), and this provides the globally unique, distributed naming system we need for distributed knowledge. URIs can have the same syntax or format as website addresses (URLs), so you will see RDF files that contain URIs, such as http://www.w3.org/1999/02/22-rdf-syntax-ns#type. The fact that it looks like a web address is totally incidental. There may or may not be an actual website at that address, and it doesn't matter for RDF--it is just a very verbose identifier. (Although sometimes there is something useful at the address.) There are also other types of URIs besides http: URIs, such as URNs and TAGs, which you'll see below. URIs are used as global names because they provide a way to break down the space of all possible names into units that have obvious owners. URIs that start with http://www.rdfabout.com/ are implicitly controlled by me because I own and control the domain, "rdfabout.com."

Since URIs can be quite long, in RDF notations they're usually abbreviated using the concept of namespaces from XML.

Literal values, like "computer science," allow text to be included in RDF. This is used heavily when RDF is used for metadata--its original purpose. In fact, literal values are primarily what tie RDF to the real world, since URIs are just arbitrary strings.

These concepts form most of the abstract RDF model for encoding knowledge. It's analogous to the common API that most XML libraries provide. If it weren't for us curious humans always peeking into files, the actual format of XML wouldn't matter so much as long as we had our appendChild, setAttribute, etc. Of course, we do need a common file format for exchanging data, and in fact there are two for RDF, which we look at in the next section.

Serialization Syntaxes: XML and Notation 3

In the previous section we covered the abstract RDF model. Now we turn to how actually to write RDF in two formats. The W3C specifications define an XML format to encode RDF. Here's an example:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

    xmlns:dc="http://purl.org/dc/elements/1.1/"

    xmlns:geo="http://www. w3.org/2003/01/geo/wgs84_pos#"

    xmlns:edu="http://www.example.org/">

    <rdf:Description rdf:about="http://www.princeton.edu">

        <geo:lat>40.35</geo:lat>

        <geo:long>-74.66</geo:long>

        <edu:hasDept rdf:resource="http://www.cs.princeton.edu"

            dc:title="Department of Computer Science"/>

    </rdf:Description>

</rdf:RDF>

In an RDF/XML document there are two types of nodes: resource nodes and property nodes. Resource nodes are the subjects and objects of statements, and they usually have an rdf:about attribute on them giving the URI of the resource they represent. In this example, the rdf:Description node is the only resource node.

Resource nodes contain (only) property nodes, which represent statements. There are three statements in this example, all with the subject <http://www.princeton.edu>, and with the predicates geo:lat, geo:long, and edu:hasDept.

Property nodes, in turn, contain literal values, like "40.35" and "-74.66," or a reference to an object resource using the rdf:resource attribute, or they may contain a full resource node as their object.

From the specification we are told how to take the XML document above and get out of it this table of statements:

            Subject            Predicate              Object

----------------------------- ----------- ------------------------

<http://www.princeton.edu>    edu:hasDept <http://www.cs.princeton.edu>

<http://www.princeton.edu>    geo:lat     "40.35"

<http://www.princeton.edu>    geo:long    "-74.66"

<http://www.cs.princeton.edu> dc:title    "Department of Computer Science"

These triples are the bread and butter of RDF. When applications use RDF in XML format, they see the triples. Note that the hierarchical structure of the XML and the order of the nodes is lost in the table of triples, which means that, like whitespace, it was not a part of the information meant to be encoded in the RDF.

Notation 3 (N3), or Turtle, is another system for writing out RDF. Since it works under the same abstract model, the difference between it and RDF/XML is superficial--readability.

The same information in the RDF/XML file written in N3 looks like this:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix dc: <http://purl.org/dc/elements/1.1/> .

@prefix geo: <http://www. w3.org/2003/01/geo/wgs84_pos#> .

@prefix edu: <http://www.example.org/> .



<http://www.princeton.edu> geo:lat "40.35" ; geo:long "-74.66" .

<http://www.cs.princeton.edu> dc:title "Department of Computer Science" .

<http://www.princeton.edu> edu:hasDept <http://www.cs.princeton.edu> .

In N3 and Turtle, statements are just written out as the subject URI (in brackets or abbreviated with namespaces), followed by the predicate URI, followed by the object URI or literal value, followed by a period. But, to save on typing, multiple statements with the same subject can be grouped together by using a semicolon and omitting the subject a second time. The semicolon on the first line indicates <http://www.princeton.edu> is the subject of both the geo:lat and geo:long predicates.

Distributed Knowledge

One can use RDF to model any type of knowledge without having to use any centrally approved notions. If no one has coined a URI for something you want to describe, you can create your own URI for it. This goes for not just subjects and objects but predicates as well. The trouble is that if I make up all of my own URIs, my RDF document has no meaning to anyone else unless I explain what each URI is intended to denote or mean. On the flip side, two documents that have some URIs in common are talking about some of the same things--necessarily because URIs always refer to the same thing in any RDF document.

And this is where the emergent aspect of the Semantic Web appears. Without much planning, RDF documents created by different people for different purposes can come together. Let's take a concrete example. Say one person--call this person Developer A--encodes the geographic and population statistics from the United States Census in N3 RDF like this:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix dc: <http://purl.org/dc/elements/1.1/> .

@prefix usgovt: <http://www.example1.org/> .

@prefix census: <http://www.example2.org/> .



<tag:www.example.org,2005:us/ny>

    rdf:type usgovt:State ;

    dc:title "New York" ;

    census:population "18976457" ;

    census:landArea "122283145776 m^2" .

...

(repeated for other states)

When it comes time to write RDF, one faces a modeling issue. What entities and predicates will be used to represent the information to be encoded? It's a problem similar to programming design, deciding what classes will be needed for a program, and what relations among the classes there will be. When you decide on the classes, each class has some purpose, or some meaning. Like "the Customer class represents a person." You don't tell the computer this. Rather, you document it for other programmers.

The same thing happens in RDF. To represent the relation between a state and its population, a modeling decision was made to use a predicate between an entity denoting the state and the literal value containing the numeric value. One doesn't tell the computer that census:population represents the relation between a state and its population--that's impossible of course. Instead, one tells other people that that's how census:population should be used. Then they use the URI accordingly in their RDF documents and applications.

Now, separately, Developer B might be publishing RDF/XML files for members of the U.S. Congress:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

    xmlns:foaf="http://xmlns.com/foaf/0.1/" 

    xmlns:usgov="http://www.example1.org" xmlns:pol="http://www.example4.org">

    <pol:Politician rdf:about="tag:govshare.info,2005:data/us/congress/people/S000148">

        <foaf:name>Charles Schumer</foaf:name>

        <usgov:party>Democrat</usgov:party>

    </pol:Politician>

    ...

    (repeated for other senators)

</rdf:RDF>

Charles Schumer is a New York senator. Developer B wants to indicate this in the RDF/XML file. If he's seen Developer A's N3 file and knows how Developer A intended tag:www.example.org,2005:us/ny to be used, he could use this:

    <pol:Politician rdf:about="tag:govshare.info,2005:data/us/congress/people/S000148">

        <usgov:represents rdf:resource="tag:www.example.org,2005:us/ny"/>

    </pol:Politician>

By reusing the URI that Developer A coined for New York, Developer B has created a bridge between their two RDF files. Because URIs are globally unique, anyone looking at the two files knows they are both referring to the very same New York (e.g., the state versus the city). And because RDF vocabularies can be mixed together, Developer C, a third party who wants to create a table of senators and the populations of the states they represent, can take the two RDF files and merge them, simply by concatenating the triples of each file:

         Subject               Predicate               Object

-------------------------  -----------------  -----------------------

<tag:govshare.../S000148>  foaf:name          "Charles Schumer"

<tag:govshare.../S000148>  usgov:party        "Democrat"

<tag:govshare.../S000148>  usgov:represents   <tag:www.example...us/ny>

...

<tag:www.example...us/ny>  dc:title           "New York"

<tag:www.example...us/ny>  census:population  "18976457"

...

(repeated for other senators)

Developer C can then write a program, or use an existing RDF query language, to trace paths through the graph starting with senators, through usgov:represents predicates or edges, then through census:population predicates to the corresponding population.

Looking Forward

The simplicity and flexibility of the triple in combination with the use of URIs for globally unique names makes RDF unique, and very powerful. It's a specification that fills a very particular niche for decentralized, distributed knowledge and provides a framework to enable computer applications to answer questions we wouldn't dream of asking computers today.

This article hardly scratches the surface of RDF, and here is a quick list of things to look at from here:

For more on the Semantic Web, see the W3C, and their Semantic Web mail list.
The W3C specifications for RDF form the definitive guide, but other sources are a better bet for learning RDF: this tutorial, my site rdf:about, and the book Practical RDF.
There are many toolkits available in a variety of programming languages for working with RDF. This page lists most of the toolkits.
RDFS and OWL are W3C vocabularies used to create schemas or ontologies. SchemaWeb maintains a list of other vocabularies.
SPARQL is the new query language for RDF. There is an article on SPARQL right here on xml.com.
Automated inferencing is one main line of interest for RDF. Cwm and Euler are two tools for inferencing over RDF, using the RDFS and OWL schemas and other general rules. Pellet is an OWL-DL reasoning engine.
Some existing RDF datasets are listed here and on rdfdata.org.