XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Introducing RDFa

February 14, 2007

For a long time now, RDF has shown great promise as a flexible format for storing, aggregating, and using metadata. Maybe for too long—its most well-known syntax, RDF/XML, is messy enough to have scared many people away from RDF. The W3C is developing a new, simpler syntax called RDFa (originally called "RDF/a") that is easy enough to create and to use in applications that it may win back a lot of the people who were first scared off by the verbosity, striping, container complications, and other complexity issues that made RDF/XML look so ugly.

RDF/XML doesn't have to be ugly, but even simple RDF/XML doesn't fit well into XHTML, because browsers and other applications designed around HTML choke on it. So, while the general plan for RDFa is to make it something that can be embedded into any XML dialect, the main effort has gone into making it easy to embed it into XHTML. This gives it an important potential role in the grand plan for the Semantic Web, in which web page data is readable not only by human eyes but by automated processes that can aggregate data and associated metadata and then perform tasks that are much more sophisticated than those that typical screen scraping applications can do now. In fact, the relationship between RDFa metadata and existing content in web pages has been an important driver in most use cases driving RDFa's progress.

Plenty of software is already available to pull RDFa triples from XHTML documents and use them, which means that even though the specification isn't quite done, there's plenty to play with.

The "a" in "RDFa"

RDF often uses a subject, predicate, object combination called a triple to specify an attribute name/value pair about a particular resource. (That's "attribute" in the object-oriented sense, not the XML sense; for example, a triple could specify that the resource with ID http://example.com/artwork#fountain has an author value of "Richard Mutt.") To allow you to add metadata to a web page without affecting a browser's display of that page, RDFa uses some existing XHTML 1 attributes and a few new XHTML 2 attributes to store the subjects, predicates, and objects of these RDF triples. (The objects may also be existing PCDATA in your web pages, with subject and predicate attributes letting this text play a dual role of human-readable displayed content and machine-readable metadata.)

RDFa uses the existing XHTML 1 attributes href, content, rel, rev, and datatype, and it uses the new about, role and property attributes from XHTML 2's Metainformation Attributes module. While the following chart of their use covers only a subset of ways to store RDFa metadata in an XHTML file, it's enough to get you pretty far. The RDFa Primer and RDF/A Syntax W3C documents (and Part 2 of this article) describe more sophisticated ways to add RDFa metadata to your XHTML documents.

There are two basic cases: triples that have a literal string as their object and triples that have a URI as their object. (When possible, it's better to have a URI as an object, because it lets the same value serve as the object of some triples and the subject of others. This makes it easier to connect triples and find new information through inferencing.)

subject predicate object
literal string as object about property content attribute or PCDATA
URI as object about rel href

The RDFa syntax document tells us that "it should be possible to represent a [triple] using only one XML element." Let's look at three examples:

    <span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html"
      property="dc:title" content="Generating a Single Globally Unique ID"/>

    <span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html"
      property="dc:title">Generating a Single Globally Unique ID</meta>

    <span about="http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html"
      rel="dc:subject" href="http://www.snee.com/bobdc.blog/neat_tricks/"/>

These triples make the following statements (assuming that the dc prefix is assigned to the standard Dublin Core URI):

  1. The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core title value of "Generating a Single Globally Unique ID."

  2. The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core title value of "Generating a Single Globally Unique ID."

  3. The resource at http://www.snee.com/bobdc.blog/2006/12/generating_a_single_globally_u.html has a Dublin Core subject value of http://www.snee.com/bobdc.blog/neat_tricks/.

If the first two triples say the same thing, why would you prefer one over the other? Assuming that your document has its title in the document's content before you begin adding RDFa markup, adding the second span element above means adding a little less text to your document; you just wrap the existing title in the span start- and end-tags shown, a technique that fits in well with the Semantic Web vision of turning existing web content into machine-readable content. If your title was already part of your document, the content attribute value of the first triple would add redundant information to your document, and if your document's title changes, you would need to change it in two places. On the other hand, when adding information that is not already part of the content of your document (for example, workflow information or attribution rights about components of the document) the first span element above provides a good model.

The third span element above uses slightly different attributes to specify a triple that has a URI as an object value.

RDFa Elements

All three of the elements above are span elements. While these are popular for RDFa because you can insert them anywhere in the body of an HTML document, you can add the same RDFa attributes to any elements you like. link and meta elements are popular for inserting triples into the head of an HTML document. This is part of the beauty of RDFa—these elements have been used to add metadata to the head element for years (for example, to indicate the URL of a web page's CSS stylesheet), and now RDFa-aware software can pull useful metadata from them with only minor modifications to these elements. (Modifications are necessary because the triple pulled from an XHTML 1 link element that points to a CSS stylesheet would not be completely legal RDF, because a rel value of "stylesheet" is not a URI and therefore not a proper RDF predicate.)

The a linking element is also popular for storing RDFa metadata, because it always expresses a relationship between one resource (the document where it's stored) and another (the resource it links to). The a element's rel attribute—which has actually been around as long as HTML itself, despite its lack of use before Google's nofollow trick came along—adds information about the relationship, and this information serves as the predicate of a triple stored in an a element.

More Triples, Fewer Subjects

If a document has 100 triples of metadata, the triples probably won't have 100 different subjects. The subject of many will probably be the document itself, as they specify its title, author, and perhaps workflow data about how the document got into its current state. Another group of triples might describe an image's photographer, date taken, and rights re-use information.

Building on existing XHTML syntax, RDFa lets you build multiple triples from the same subject without cluttering up your document too much. An RDFa processor that finds no about attribute assumes that the about attribute on the nearest ancestor element is the subject. (As we'll see in Part 2 of this article, the presence of an id attribute can provide an alternative to this behavior.) For example, the following stores three metadata statements about the resource at http://www.snee.com/img/myfile.jpg, because although the three span elements have no about attribute, their parent img does:

<img src="http://www.snee.com/img/myfile.jpg"
     about="http://www.snee.com/img/myfile.jpg">
  <span property="dc:subject" content="Niagra Falls"/>
  <span property="dc:creator" content="Richard Mutt"/>
  <span property="dc:format" content="img/jpeg"/>
</img>

If the RDFa processor searches through all the ancestors of the element with a metadata statement's predicate and object, and doesn't find an about attribute, then the subject is an empty string. According to the RDF/A Syntax specification, this "effectively indicates the current document."

This is handy, because plenty of a document's metadata is typically about the document itself. For example, your document's main title could have this span element to indicate that its contents is the Dublin Core title of the work (assuming that no ancestor of the sample h1 element has an about attribute):

  <h1><span property="dc:title">My Story</span></h1>

Metadata about the document with no displayable content can be stored in the head element of the document:

<html>
  <head>
    <meta description="dc:date" content="2007-03-15T10:35:42"/>

Now that we've seen a scattershot tour of what RDFa can do and how it does it, it would be easier to appreciate its potential uses if we step back and look at three categories of use cases:

  • Inline metadata about document components

  • Metadata about the containing document

  • Out-of-line metadata


Related Reading

Google Engineering Explains Microformat Support in Searches
Listen to the exclusive interview with Othar Hansson and RV Guha, two of the Google engineers responsible for functionality that supports parsing and display of microformat data in Google search results.

Google Announces Support for Microformats and RDFa
Learn about the underlying technology that supports microformats and RDFa functionality in Google search and how you can prepare your own content to work with this emerging technology.

Pages: 1, 2

Next Pagearrow







close