Introducing RDFa, Part Two
April 4, 2007
In part 1 of this
article, we saw that RDFa, a new syntax for representing RDF triples, can be embedded
into
arbitrary XML documents more easily than RDF/XML. RDFa is particularly good for embedding
these triples into XHTML 2, which has a few new attributes that make it easier to
use RDFa.
Part 1 of this article showed several roles that RDFa metadata can play, describing
metadata
about the containing document and metadata about individual elements within the document.
We
also saw how RDFa can represent triples that use existing web page content as their
subject
and triples that specify new objects, which are useful for adding workflow metadata
about a
document or for specifying normalized values such as "2007-04-23"
as metadata
associated with a date displayed on a web page as "April 23, 2007". This article shows
how
to use RDFa to express additional, richer metadata, and we'll explore some ideas to
automate
the generation of RDFa markup.
Data Typing
One classic bit of metadata to add to a piece of data is an indication of that data's
type.
RDF lets
you use datatypes from XML Schema Part 2, a spec that offers choices for most of the typical types you'll
find in a programming language or database package. To add a datatype to the kind
of RDFa
markup that we saw in part 1 of this article, you simply add a datatype
attribute.
For example, let's say you want to identify the types of the values in the following HTML table:
Shipment ID | Date | Amount | Anodized |
---|---|---|---|
x432 | 2007-04-23 | 34 | Yes |
x921 | 2007-04-25 | 41 | No |
x0731 | 2007-04-28 | 17 | No |
Because each row of the table is about a particular shipment of widgets, the first
step
when adding RDFa triples that describe the shipments is the addition of an
about
attribute to each row to name the subject of the triples for that
shipment. A span
element around each value in the row can include a
property
attribute to show what property that value indicates for that row's
shipment, as shown in the source below.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.foobarco.com/ns/vocab#" xmlns:fbi="http://www.foobarco.com/ns/ID#" xmlns:xs="http://www.w3.org/2001/XMLSchema#"> <!-- head element, start of body and table... --> <tr about="[fbi:x432]" > <td><span property="fb:shipmentID">x432</span></td> <td><span property="fb:date" datatype="xs:date">2007-04-23</span></td> <td><span property="fb:amount" datatype="xs:integer">34</span></td> <td><span property="fb:anodized" datatype="xs:boolean" content="true">yes</span></td> </tr> <!-- remaining rows of table... -->
Nearly all of this should be familiar from part 1 of this article. The only new bit
of
syntax in the RDFa-enhanced HTML of the table is the datatype
attribute, which
identifies the type of the value inside the span
element. Now, an RDFa
extraction routine can get these values and pass them along to an application that
can do
more with typed data than it can with a collection of strings. Also note that, to
build on
the use of the content
attribute described in part 1, each row's last
td
element includes one of these attributes, with a value of "true" or
"false" instead of "yes" or "no", which are not valid Boolean values. This way, the
RDFa
extractor will see proper Boolean values for the triples describing whether each widget
shipment is anodized.
With all the RDFa markup added, it may look verbose, but this kind of tabular representation of data is usually automatically generated from backend relational databases anyway. It wouldn't be much trouble to have the HTML generation routines add these extra attributes, thereby making the data more valuable to other applications. Below, we'll find out about some other applications that, because they generate HTML from templates, are excellent platforms for generating lots of useful RDFa markup with minimal trouble.
The rev Attribute
In part 1 of this article, we saw that RDFa can use the a
element's venerable
but little-used rel
attribute to indicate a resource's relationship to another
resource—or, in RDF terms, to serve as the predicate of a triple, with the
about
attribute naming the subject and the href
attribute naming
the object. The a
element's even less-used rev
attribute expresses
the opposite: a triple in which the href
attribute names the subject and the
about
attribute of the element (or of the nearest ancestor with one) names
the object.
Both rel
and rev
can go in the same element to describe two
different relationships, such as the following one showing that the Supreme Court
case Brown
versus Board of Education overturned Plessy versus Ferguson:
<span about="http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=347&invol=483" rel="fb:overturns" rev="fb:overturnedBy" href="http://caselaw.lp.findlaw.com/cgi-bin/getcase.pl?court=us&vol=163&invol=537"/>
(In a system using OWL, you wouldn't really have these rel
and
rev
attributes in the same element. You'd just have one, and a separate rule
declaring that fb:overturns
and fb:overturnedBy
are inverse
properties. This way, either could be inferred from the other.) If an ontology doesn't
include the relationship that you want to specify but does offer its inverse, the
difference
between rel
and rev
gives you flexibility that can be especially
valuable if you're representing a relationship between a resource you can edit and
one you
can't. For example, if your ontology has fb:overturnedBy
but not
fb:overturns
, you could add the following metadata to the document at
http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=347&invol=483:
<span rev="fb:overturnedBy" href="http://caselaw.lp.findlaw.com/cgi-bin/getcase.pl?court=us&vol=163&invol=537"/>
The lack of an about
attribute (assuming that no ancestor element has one
either) indicates that the document itself is the object of the triple: Supreme Court
case
347 US 483 overturned 163 US 537.
CURIEs
We use URIs to qualify names in order to make their context absolutely clear—for
example, to show that one use of the word "title" comes from the Dublin Core namespace,
and
therefore refers to a published work, while another might come from a real estate
document
namespace and therefore refer to a deed to property. Writing out full URIs with each
name
(for example, http://purl.org/dc/elements/1.1/title) can make things pretty verbose,
so a
namespace declaration such as xmlns:dc="http://purl.org/dc/elements/1.1/"
lets
us use a prefix that will stand in for the URL that identifies a name's namespace.
This lets
us use shorter versions of our names while still being clear where they came from.
We call a
name such as dc:title
a qualified name, or qname.
Qnames used in attribute values can lead to problems, because not all processing programs know that they should compare the prefix with the namespace declarations to see which namespace the name really comes from. It has worked out fine for XSLT use, because XSLT processors all know that qnames represent elements in source documents, but this has led to a problem in RDF use, because RDF uses URIs to identify the namespace of values as well as namespaces of elements and attributes.
For example, if I can set the standards for URL patterns at FooBar Company, and I
want to
represent employee number 4942 as http://www.foobarco.com/ns/empID#4942
,
there's no problem so far. If I say
xmlns:fb="http://www.foobarco.com/ns/empID#"
, there's still no problem, but
there is a problem if I represent the employee as "fb:4942"
, because it doesn't
conform to the qname spec. Qnames were designed around XML names, or the names that
you're
allowed to make up for elements and attributes, and those names must begin with a
letter.
So, to keep the use of namespace prefixes instead of full URIs legal with the existing
specs, we can't use them with values that begin with a numeric digit.
Lots of important values begin with numeric digits. Besides employee IDs and other
ID
numbers, the CURIE
Working Draft points out that International Press Telecommunications Council metadata
often begins with a digit. To address this, we now have a new URL abbreviation syntax
knowns
as the Compact URI, or CURIE, syntax. CURIEs are pretty much like URIs with looser
rules for
what comes after the colon: you can use any character that can be in a URI. (One handy
corollary of this is that qnames are valid CURIEs.) Just about the only bit of new
syntax to
learn for using CURIEs is the square brackets that go around a CURIE value when used
where
URIs are also allowed, such as in an about
or href
attribute:
<tr about="[fbi:x432]" > <td><span property="fb:shipmentID">x432</span></td> <td><span property="fb:date" datatype="xs:date">2007-04-23</span></td> <td><span property="fb:amount" datatype="xs:integer">34</span></td> <td><span property="fb:anodized" datatype="xs:boolean" content="true">yes</span></td> </tr>
The square brackets seem to be a nod to the syntax used to represent links in wikis. The example above would work exactly the same without the square brackets. But now, if you ever see square brackets, you'll know why they're there.
Reification (Sort of)
Reification is the assignment of metadata to metadata. This sounds pretty abstract, but if you consider that metadata is data to track, just like any other, it's easier to see the value of reification. For example, if a document has an RDF triple saying, "this document was created by Richard Mutt," another triple saying that the triple about the document's creator was created on 2007-04-19 would be metadata about that metadata.
RDFa's designers had reification on the original list of RDF features that RDFa would eventually be able to represent, but they're having second thoughts, and the latest version of the RDFa Primer no longer mentions it. The plan for RDFa was always to make it a subset of RDF, and reification may not make the cut. (XML came to exist via a similar cutting out of potentially complex and confusing features, as its designers were creating a subset of SGML.) Still, I couldn't resist demonstrating a reification-like technique with RDFa that can be useful in web or other hypertext applications.
An HTML a
linking element describes a relationship between the document
containing the a
element and the resource that it points to. If you're really
interested in tracking metadata about your hypertext links, you can add an
about
attribute to the a
element and add empty span
element children, as shown here, to store metadata about the linking element.
<p>Mr. Breakfast has a nice <a about="link23" href="http://www.mrbreakfast.com/article.asp?articleid=17"> <span property="fb:addedBy" content="BD"/> <span property="fb:lastChecked" content="2007-03-15"/> scrambled eggs recipe</a>.</p>
This is not really reification because it's not metadata about metadata. In this case,
it's
metadata about a specific HTML element: the a
element with an
about
value of "link23"
, which happens to link to another
element. It's still useful, and may whet your appetite for proper reification as a
feature
of more full-featured RDF syntaxes.
Showing Some Class
In addition to specifying properties and values of a resource, RDFa can identify the
resource as an individual of a particular class. When you have an ontology of information
about a set of classes, you have additional information about individuals of those
classes,
so knowing an individual's class membership lets you do more with it. For example,
if you
know that a resource is a widgetShipment
, ontology information about this class
may have relevant storage and safety information.
This is a nice example of RDFa building on an obvious bit of HTML syntax to add some
RDF
power: you simply use the class
attribute, which has been around since HTML 2.0.
For example, the class
attribute in the following example tells us that the
fbi:xbi432
resource is an individual of the fb:widgetShipment
class:
<tr about="[fbi:x432]" class="fb:widgetShipment"> <td><span property="fb:shipmentID">x432</span></td> <td><span property="fb:date" datatype="xs:date">2007-04-23</span></td> <td><span property="fb:amount" datatype="xs:integer">34</span></td> </tr>
Extracting the triples and converting them to RDF/XML would result in something like this:
<fb:widgetShipment rdf:about="http://www.foobarco.com/ns/ID#x432"> <fb:anodized rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</fb:anodized> <fb:amount rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">34</fb:amount> <fb:date rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2007-04-23</fb:date> </fb:widgetShipment>
(Because this is a newer aspect of RDFa, no RDFa extractors support it as of this writing, but I'm looking forward to it being supported in the future.)
Auto-Generation of RDFa Metadata
All of my examples so far have been hand-coded, but when you consider the huge infrastructure of HTML-generating systems, it's not difficult to find opportunities for automatically generating large amounts of useful, machine-readable RDF triples inside of web pages. Templating languages typically give you a way to add HTML (or, if you prefer, XHTML) markup around the templating language's codes that indicates which values to plug in from another data source.
For example, the rhtml template files of a Ruby on Rails application let you specify
the
markup for one row of an HTML table, and then tell the Ruby interpreter to generate
a row
with that markup for each row of a table retrieved as part of a database query. You
can add
about
attributes and span
wrapper elements to the table markup
as easily as you can add td
elements and align
attributes, and
pretty soon your Ruby on Rails application is automatically generating triples of
machine-readable typed values similar to those in the widget shipment table shown
above. The
same principle works with PHP scripts, Active Server Pages, and HTML generated by
XQuery
servers.
Weblogging platforms also provide customizable templates to control the HTML that
they
generate. My host provider offers Movable Type as a weblogging platform, so I've been
using
it for a few years. When I insert RDFa markup into a template with Movable Type
tags such as <$MTeEntryPermalink$>
and
<$MTSubCategoryPath$>
inside that markup, the Movable Type engine
replaces its tags with the appropriate values for each weblog entry page being generated.
For example, I added some RDFa markup with Movable Type tags in the head
section of the template, like this:
<meta about= "<$MTEntryPermalink$>"> <link rel="trackback:ping" href="http://madskills.com/public/xml/rss/module/trackback/"/> <link rel="dc:identifier" href="<$MTEntryPermalink$>"/> <link rel="dc:subject" href='http://www.snee.com/bobdc.blog/<$MTSubCategoryPath$>'/> </meta>
and I wrapped some span
elements around body
content, like
this:
<h3 class="entry-header"><span property="dc:title"><$MTEntryTitle$></span></h3>
For one recent weblog entry, Moveable Type generated this for the header:
<meta about= "http://www.snee.com/bobdc.blog/2007/03/new_eric_van_der_vlist_book_on.html"> <link rel="trackback:ping" href="http://madskills.com/public/xml/rss/module/trackback/"/> <link rel="dc:identifier" href="http://www.snee.com/bobdc.blog/2007/03/new_eric_van_der_vlist_book_on.html"/> <link rel="dc:subject" href='http://www.snee.com/bobdc.blog/xml'/> </meta>
and it generated this for the h3
part shown
above:
<h3 class="entry-header"><span property="dc:title">New Eric van der Vlist book on Schematron out</span></h3>
An RDFa extractor gets (among other triples) the following RDF out of the document, shown here in RDF/XML:
<rdf:Description rdf:about="http://www.snee.com/bobdc.blog/2007/03/new_eric_van_der_vlist_book_on.html"> <trackback:ping rdf:resource="http://madskills.com/public/xml/rss/module/trackback/"/> <dc:subject rdf:resource="http://www.snee.com/bobdc.blog/xml"/> <dc:identifier rdf:resource="http://www.snee.com/bobdc.blog/2007/03/new_eric_van_der_vlist_book_on.html"/> <dc:title rdf:datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral">New Eric van der Vlist book on Schematron out</dc:title> </rdf:Description>
Movable Type creates the RDFa I've shown here for each new file that it creates. And, for that matter, for each old file that it creates as well, because it's easy enough to tell Movable Type to regenerate all of them. So shortly after I made this change to the template, I had nice RDFa metadata in all the weblog entries I'd ever written on this system. To harvest that metadata, I could use a script with a single wget or curl call for each weblog entry to combine that metadata into a single file, and then I could create specialized tables of contents, reports, Topic Maps, and other applications around this content collection.
Whenever you see HTML being generated automatically, you have an opportunity to create RDFa. Movie timetables, price lists, and so many other web pages where we look up information are generated from a backend database. This is fertile ground for easy RDFa generation, which could make RDFa's ease of incorporating proper RDF triples into straightforward HTML one of the great milestones in the building of the semantic web.