Introducing RDFa, Part Two

April 4, 2007

In part 1 of this article, we saw that RDFa, a new syntax for representing RDF triples, can be embedded into arbitrary XML documents more easily than RDF/XML. RDFa is particularly good for embedding these triples into XHTML 2, which has a few new attributes that make it easier to use RDFa. Part 1 of this article showed several roles that RDFa metadata can play, describing metadata about the containing document and metadata about individual elements within the document. We also saw how RDFa can represent triples that use existing web page content as their subject and triples that specify new objects, which are useful for adding workflow metadata about a document or for specifying normalized values such as "2007-04-23" as metadata associated with a date displayed on a web page as "April 23, 2007". This article shows how to use RDFa to express additional, richer metadata, and we'll explore some ideas to automate the generation of RDFa markup.

Data Typing

One classic bit of metadata to add to a piece of data is an indication of that data's type. RDF lets you use datatypes from XML Schema Part 2, a spec that offers choices for most of the typical types you'll find in a programming language or database package. To add a datatype to the kind of RDFa markup that we saw in part 1 of this article, you simply add a datatype attribute.

For example, let's say you want to identify the types of the values in the following HTML table:

Shipment ID	Date	Amount	Anodized
x432	2007-04-23	34	Yes
x921	2007-04-25	41	No
x0731	2007-04-28	17	No

Because each row of the table is about a particular shipment of widgets, the first step when adding RDFa triples that describe the shipments is the addition of an about attribute to each row to name the subject of the triples for that shipment. A span element around each value in the row can include a property attribute to show what property that value indicates for that row's shipment, as shown in the source below.

<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:fb="http://www.foobarco.com/ns/vocab#"
      xmlns:fbi="http://www.foobarco.com/ns/ID#"
      xmlns:xs="http://www.w3.org/2001/XMLSchema#">

<!-- head element, start of body and table...  -->

      <tr about="[fbi:x432]" >
        <td><span property="fb:shipmentID">x432</span></td>
        <td><span property="fb:date"
                  datatype="xs:date">2007-04-23</span></td>
        <td><span property="fb:amount"
                  datatype="xs:integer">34</span></td>
        <td><span property="fb:anodized"
                  datatype="xs:boolean" content="true">yes</span></td>
      </tr>

<!-- remaining rows of table... -->

Nearly all of this should be familiar from part 1 of this article. The only new bit of syntax in the RDFa-enhanced HTML of the table is the datatype attribute, which identifies the type of the value inside the span element. Now, an RDFa extraction routine can get these values and pass them along to an application that can do more with typed data than it can with a collection of strings. Also note that, to build on the use of the content attribute described in part 1, each row's last td element includes one of these attributes, with a value of "true" or "false" instead of "yes" or "no", which are not valid Boolean values. This way, the RDFa extractor will see proper Boolean values for the triples describing whether each widget shipment is anodized.

With all the RDFa markup added, it may look verbose, but this kind of tabular representation of data is usually automatically generated from backend relational databases anyway. It wouldn't be much trouble to have the HTML generation routines add these extra attributes, thereby making the data more valuable to other applications. Below, we'll find out about some other applications that, because they generate HTML from templates, are excellent platforms for generating lots of useful RDFa markup with minimal trouble.

The rev Attribute

In part 1 of this article, we saw that RDFa can use the a element's venerable but little-used rel attribute to indicate a resource's relationship to another resource—or, in RDF terms, to serve as the predicate of a triple, with the about attribute naming the subject and the href attribute naming the object. The a element's even less-used rev attribute expresses the opposite: a triple in which the href attribute names the subject and the about attribute of the element (or of the nearest ancestor with one) names the object.

Both rel and rev can go in the same element to describe two different relationships, such as the following one showing that the Supreme Court case Brown versus Board of Education overturned Plessy versus Ferguson:

<span about="http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=347&invol=483" 
      rel="fb:overturns" rev="fb:overturnedBy"
href="http://caselaw.lp.findlaw.com/cgi-bin/getcase.pl?court=us&vol=163&invol=537"/>

(In a system using OWL, you wouldn't really have these rel and rev attributes in the same element. You'd just have one, and a separate rule declaring that fb:overturns and fb:overturnedBy are inverse properties. This way, either could be inferred from the other.) If an ontology doesn't include the relationship that you want to specify but does offer its inverse, the difference between rel and rev gives you flexibility that can be especially valuable if you're representing a relationship between a resource you can edit and one you can't. For example, if your ontology has fb:overturnedBy but not fb:overturns, you could add the following metadata to the document at http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=347&invol=483:

<span rev="fb:overturnedBy"
href="http://caselaw.lp.findlaw.com/cgi-bin/getcase.pl?court=us&vol=163&invol=537"/>

The lack of an about attribute (assuming that no ancestor element has one either) indicates that the document itself is the object of the triple: Supreme Court case 347 US 483 overturned 163 US 537.

CURIEs

We use URIs to qualify names in order to make their context absolutely clear—for example, to show that one use of the word "title" comes from the Dublin Core namespace, and therefore refers to a published work, while another might come from a real estate document namespace and therefore refer to a deed to property. Writing out full URIs with each name (for example, http://purl.org/dc/elements/1.1/title) can make things pretty verbose, so a namespace declaration such as xmlns:dc="http://purl.org/dc/elements/1.1/" lets us use a prefix that will stand in for the URL that identifies a name's namespace. This lets us use shorter versions of our names while still being clear where they came from. We call a name such as dc:title a qualified name, or qname.

Qnames used in attribute values can lead to problems, because not all processing programs know that they should compare the prefix with the namespace declarations to see which namespace the name really comes from. It has worked out fine for XSLT use, because XSLT processors all know that qnames represent elements in source documents, but this has led to a problem in RDF use, because RDF uses URIs to identify the namespace of values as well as namespaces of elements and attributes.

For example, if I can set the standards for URL patterns at FooBar Company, and I want to represent employee number 4942 as http://www.foobarco.com/ns/empID#4942, there's no problem so far. If I say xmlns:fb="http://www.foobarco.com/ns/empID#", there's still no problem, but there is a problem if I represent the employee as "fb:4942", because it doesn't conform to the qname spec. Qnames were designed around XML names, or the names that you're allowed to make up for elements and attributes, and those names must begin with a letter. So, to keep the use of namespace prefixes instead of full URIs legal with the existing specs, we can't use them with values that begin with a numeric digit.

Lots of important values begin with numeric digits. Besides employee IDs and other ID numbers, the CURIE Working Draft points out that International Press Telecommunications Council metadata often begins with a digit. To address this, we now have a new URL abbreviation syntax knowns as the Compact URI, or CURIE, syntax. CURIEs are pretty much like URIs with looser rules for what comes after the colon: you can use any character that can be in a URI. (One handy corollary of this is that qnames are valid CURIEs.) Just about the only bit of new syntax to learn for using CURIEs is the square brackets that go around a CURIE value when used where URIs are also allowed, such as in an about or href attribute:

<tr about="[fbi:x432]" >
  <td><span property="fb:shipmentID">x432</span></td>
  <td><span property="fb:date"
            datatype="xs:date">2007-04-23</span></td>
  <td><span property="fb:amount"
            datatype="xs:integer">34</span></td>
  <td><span property="fb:anodized"
            datatype="xs:boolean" content="true">yes</span></td>
</tr>

The square brackets seem to be a nod to the syntax used to represent links in wikis. The example above would work exactly the same without the square brackets. But now, if you ever see square brackets, you'll know why they're there.

Reification (Sort of)

Reification is the assignment of metadata to metadata. This sounds pretty abstract, but if you consider that metadata is data to track, just like any other, it's easier to see the value of reification. For example, if a document has an RDF triple saying, "this document was created by Richard Mutt," another triple saying that the triple about the document's creator was created on 2007-04-19 would be metadata about that metadata.

RDFa's designers had reification on the original list of RDF features that RDFa would eventually be able to represent, but they're having second thoughts, and the latest version of the RDFa Primer no longer mentions it. The plan for RDFa was always to make it a subset of RDF, and reification may not make the cut. (XML came to exist via a similar cutting out of potentially complex and confusing features, as its designers were creating a subset of SGML.) Still, I couldn't resist demonstrating a reification-like technique with RDFa that can be useful in web or other hypertext applications.

An HTML a linking element describes a relationship between the document containing the a element and the resource that it points to. If you're really interested in tracking metadata about your hypertext links, you can add an about attribute to the a element and add empty span element children, as shown here, to store metadata about the linking element.

<p>Mr. Breakfast has a nice
  <a about="link23"
     href="http://www.mrbreakfast.com/article.asp?articleid=17">
<span property="fb:addedBy" content="BD"/>
<span property="fb:lastChecked" content="2007-03-15"/>
scrambled eggs recipe</a>.</p>

This is not really reification because it's not metadata about metadata. In this case, it's metadata about a specific HTML element: the a element with an about value of "link23", which happens to link to another element. It's still useful, and may whet your appetite for proper reification as a feature of more full-featured RDF syntaxes.

Showing Some Class

In addition to specifying properties and values of a resource, RDFa can identify the resource as an individual of a particular class. When you have an ontology of information about a set of classes, you have additional information about individuals of those classes, so knowing an individual's class membership lets you do more with it. For example, if you know that a resource is a widgetShipment, ontology information about this class may have relevant storage and safety information.

This is a nice example of RDFa building on an obvious bit of HTML syntax to add some RDF power: you simply use the class attribute, which has been around since HTML 2.0.

For example, the class attribute in the following example tells us that the fbi:xbi432 resource is an individual of the fb:widgetShipment class:

<tr about="[fbi:x432]" class="fb:widgetShipment">
  <td><span property="fb:shipmentID">x432</span></td>
  <td><span property="fb:date"
            datatype="xs:date">2007-04-23</span></td>
  <td><span property="fb:amount"
            datatype="xs:integer">34</span></td>
</tr>

Extracting the triples and converting them to RDF/XML would result in something like this:

<fb:widgetShipment rdf:about="http://www.foobarco.com/ns/ID#x432">
  <fb:anodized rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</fb:anodized>
  <fb:amount rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">34</fb:amount>
  <fb:date rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2007-04-23</fb:date>
</fb:widgetShipment>

(Because this is a newer aspect of RDFa, no RDFa extractors support it as of this writing, but I'm looking forward to it being supported in the future.)

Auto-Generation of RDFa Metadata

All of my examples so far have been hand-coded, but when you consider the huge infrastructure of HTML-generating systems, it's not difficult to find opportunities for automatically generating large amounts of useful, machine-readable RDF triples inside of web pages. Templating languages typically give you a way to add HTML (or, if you prefer, XHTML) markup around the templating language's codes that indicates which values to plug in from another data source.

For example, the rhtml template files of a Ruby on Rails application let you specify the markup for one row of an HTML table, and then tell the Ruby interpreter to generate a row with that markup for each row of a table retrieved as part of a database query. You can add about attributes and span wrapper elements to the table markup as easily as you can add td elements and align attributes, and pretty soon your Ruby on Rails application is automatically generating triples of machine-readable typed values similar to those in the widget shipment table shown above. The same principle works with PHP scripts, Active Server Pages, and HTML generated by XQuery servers.

Weblogging platforms also provide customizable templates to control the HTML that they generate. My host provider offers Movable Type as a weblogging platform, so I've been using it for a few years. When I insert RDFa markup into a template with Movable Type tags such as <$MTeEntryPermalink$> and <$MTSubCategoryPath$> inside that markup, the Movable Type engine replaces its tags with the appropriate values for each weblog entry page being generated. For example, I added some RDFa markup with Movable Type tags in the head section of the template, like this:

<meta about= "<$MTEntryPermalink$>">
  <link rel="trackback:ping" href="http://madskills.com/public/xml/rss/module/trackback/"/>
  <link rel="dc:identifier" href="<$MTEntryPermalink$>"/>
  <link rel="dc:subject" href='http://www.snee.com/bobdc.blog/<$MTSubCategoryPath$>'/>
</meta>

and I wrapped some span elements around body content, like this:

<h3 class="entry-header"><span property="dc:title"><$MTEntryTitle$></span></h3>

For one recent weblog entry, Moveable Type generated this for the header:

<meta about= "http://www.snee.com/bobdc.blog/2007/03/new_eric_van_der_vlist_book_on.html">
  <link rel="trackback:ping" href="http://madskills.com/public/xml/rss/module/trackback/"/>
  <link rel="dc:identifier" href="http://www.snee.com/bobdc.blog/2007/03/new_eric_van_der_vlist_book_on.html"/>
  <link rel="dc:subject" href='http://www.snee.com/bobdc.blog/xml'/>
</meta>

and it generated this for the h3 part shown above:

<h3 class="entry-header"><span property="dc:title">New Eric van der Vlist book on 
Schematron out</span></h3>

An RDFa extractor gets (among other triples) the following RDF out of the document, shown here in RDF/XML:

<rdf:Description rdf:about="http://www.snee.com/bobdc.blog/2007/03/new_eric_van_der_vlist_book_on.html">
  <trackback:ping rdf:resource="http://madskills.com/public/xml/rss/module/trackback/"/>
  <dc:subject rdf:resource="http://www.snee.com/bobdc.blog/xml"/>
  <dc:identifier rdf:resource="http://www.snee.com/bobdc.blog/2007/03/new_eric_van_der_vlist_book_on.html"/>
  <dc:title rdf:datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral">New Eric van der 
   Vlist book on Schematron out</dc:title>
</rdf:Description>

Movable Type creates the RDFa I've shown here for each new file that it creates. And, for that matter, for each old file that it creates as well, because it's easy enough to tell Movable Type to regenerate all of them. So shortly after I made this change to the template, I had nice RDFa metadata in all the weblog entries I'd ever written on this system. To harvest that metadata, I could use a script with a single wget or curl call for each weblog entry to combine that metadata into a single file, and then I could create specialized tables of contents, reports, Topic Maps, and other applications around this content collection.

Whenever you see HTML being generated automatically, you have an opportunity to create RDFa. Movie timetables, price lists, and so many other web pages where we look up information are generated from a backend database. This is fertile ground for easy RDFa generation, which could make RDFa's ease of incorporating proper RDF triples into straightforward HTML one of the great milestones in the building of the semantic web.