XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Make Your XML RDF-Friendly

October 30, 2002

Suppose you're designing an XML application or maybe just writing a DTD or schema. You've followed various best practices about element and attribute names, when to use elements versus attributes, and other design issues, because you want your XML to be useful in the widest variety of situations.

As RDF interest and application development grows, there's an increasing payoff in keeping RDF concerns in mind along with the other best practices as you design document types. Your documents store information, and small tweaks to their structure can allow an RDF processor to see that information as subject-predicate-object triples, which it can make good use of. (For an introduction to RDF, see Tim Bray's article What is RDF?) Making your documents more "RDF-friendly" -- that is, more easily digestible by RDF applications -- broadens the range of applications that can use your documents, thereby increasing their value.

A lot of XML RDF documents look like they were designed purely for RDF applications, but that's not always the case. The frequent verbosity of RDF XML, which often intimidates RDF beginners, is a by-product of the flexibility that makes RDF easy to incorporate into your existing XML. By observing eight guidelines when designing a DTD or schema, you can use this flexibility to help your documents work with RDF applications as well as non-RDF applications. Some of the guidelines are easy, while some involve making choices based on trade-offs. But knowing what the issues are gives you a better perspective on the best ways to model your data.

1. Make sure that every element comes from a specific namespace.

This doesn't mean that all your elements need a namespace prefix. For convenience, many documents declare the most frequently used namespace as the default one so that elements from that namespace need no prefix. For example, the article, body, title, and para elements in the following belong to the http://www.snee.com/ns/dummy namespace because the article element's first xmlns attribute declares that as the default namespace. None of those elements need a namespace prefix, and an RDF processor will have no problem with them. (The RDF namespace, http://www.w3.org/1999/02/22-rdf-syntax-ns#, must obviously be declared if an RDF parser is going to find the RDF elements and know what each is for.)

<article xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:ID="a1003"

    <rdf:Description rdf:about="#a1003">
      <dc:creator>Herman Melville</dc:creator>
    <title>Moby Dick</title>
    <para>Call me Ishmael.</para>
    <para>Just <emph>don't</emph> call me late for supper.</para>

2. Use rdf:ID attributes instead of ID attributes.

When you want an RDF processor to know a property of something in a document -- for example, that the article element in the example above has a dc:creator value of "Herman Melville" -- you need a way to identify the subject that has the property. XML DTDs let you declare that a particular attribute is used as an ID value, but RDF doesn't care about DTDs. The only way to be sure that an RDF processor can find the thing you're referring to is to give it a unique value in an rdf:ID attribute.

You're certainly not limited to using the rdf:ID value in RDF applications. A unique ID value is a unique ID value, and useful in all kinds of applications. In fact, if you declare this attribute in a DTD as having a type of ID, you'll get the benefit of both RDF applications and XML 1.0 applications treating rdf:ID as an ID value that is unique within each document.

3. When describing a resource that has an existing URI, put the URI in an rdf:about attribute.

While rdf:ID identifies a resource in your document, which you can then describe with an RDF statement, rdf:about lets you create an RDF statement about anything that can be referenced with a URI, whether it's in your document or not. The name of the element with the rdf:about attribute identifies the type of the subject. For example, the following tells us this fact "about" Bridget Fonda: that her father is Peter Fonda. The rdf:about attribute's presence in an Entertainer element tells us that Bridget Fonda is a resource of the type "Entertainer."

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

  <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Bridget">
      <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Peter"/>


4. When referencing something by its URI, put the URI in an rdf:resource attribute in an empty element.

In our first example, the creator of the article Moby Dick& -- or, more correctly, the creator of the work identified as "#a1003" -- is named with the string "Herman Melville." If, instead of a string, it identified the author using a URI in an rdf:resource attribute, the RDF assertion about who created resource a1003 would have more value, because it could then link to other RDF statements that use the same URI.

For example, no RDF statement that tells you that Herman Melville was born in New York City would refer to the author using the string "Herman Melville," because an RDF statement's subject must be a URI. Instead, it might say that the subject http://www.online-literature.com/melville/ has the property bornIn with a value of "New York City." An inference engine could look at that assertion and the following revision of the first RDF statement from the first example above, put the two together, and tell you that the creator of a1003 was born in New York City.

<rdf:Description rdf:about="#a1003">
  <dc:creator rdf:resource="http://www.online-literature.com/melville"/>

While this element with the rdf:resource attribute isn't absolutely required to be empty, any content that it has must follow certain rules, so it's simplest to make it an empty element whose rdf:resource attribute names a URI value for the type named by the element name -- in this case, dc:creator.

5. If existing ontologies cover any of your element names, use those instead of making up your own URIs.

Most of the power of RDF comes from the network effect of combining RDF triples that reference the same resources. If one set of triples says something about a particular resource and another set says more about the same resource, they can be combined, making it a more valuable collection. For example, guideline 4 above described two RDF statements that could be linked this way; one used the URI http://www.online-literature.com/melville to represent Herman Melville as the creator of article a1003, and the other used the same URI to show where the author was born.

To be honest, http://www.online-literature.com/melville was just the result of some brief web searching. The odds that two different people creating RDF about Melville will both use this URI are pretty small. It's not really an ontology name, but just a URL for a brief biography of Melville at a literary dot-com.

But what is an ontology? In software development, as distinct from its meaning in philosophy, it generally means a set of terms with defined relationships. There are plenty of real ontologies out there, but in a pinch, you can use a recognized URL for a well-known web page that identifies your resource -- as we saw above, any URI is better than a simple string.

The more well-known an ontology is, the more likely others are to use it, and the more useful your RDF statements will be when combined with those others. For example, the Dublin Core ontology used for the dc:creator and dc:date elements in the "Moby Dick" example is one of the most popular, widely-used ontologies.

The DAML Ontology Library is a good place to start looking for ontologies. It's where I found the GEDCOM and CYC ontologies used in the example about the Fondas. The people who created the Internet Movie Database never considered their work to be an ontology, but because it lets you refer to specific actors with URIs, it passes the first test for use in RDF statements.

Pages: 1, 2

Next Pagearrow