XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Introducing RDFa, Part Two

April 04, 2007

In part 1 of this article, we saw that RDFa, a new syntax for representing RDF triples, can be embedded into arbitrary XML documents more easily than RDF/XML. RDFa is particularly good for embedding these triples into XHTML 2, which has a few new attributes that make it easier to use RDFa. Part 1 of this article showed several roles that RDFa metadata can play, describing metadata about the containing document and metadata about individual elements within the document. We also saw how RDFa can represent triples that use existing web page content as their subject and triples that specify new objects, which are useful for adding workflow metadata about a document or for specifying normalized values such as "2007-04-23" as metadata associated with a date displayed on a web page as "April 23, 2007". This article shows how to use RDFa to express additional, richer metadata, and we'll explore some ideas to automate the generation of RDFa markup.

Data Typing

One classic bit of metadata to add to a piece of data is an indication of that data's type. RDF lets you use datatypes from XML Schema Part 2, a spec that offers choices for most of the typical types you'll find in a programming language or database package. To add a datatype to the kind of RDFa markup that we saw in part 1 of this article, you simply add a datatype attribute.

For example, let's say you want to identify the types of the values in the following HTML table:

Shipment ID Date Amount Anodized
x432 2007-04-23 34 Yes
x921 2007-04-25 41 No
x0731 2007-04-28 17 No

Because each row of the table is about a particular shipment of widgets, the first step when adding RDFa triples that describe the shipments is the addition of an about attribute to each row to name the subject of the triples for that shipment. A span element around each value in the row can include a property attribute to show what property that value indicates for that row's shipment, as shown in the source below.

<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:fb="http://www.foobarco.com/ns/vocab#"
      xmlns:fbi="http://www.foobarco.com/ns/ID#"
      xmlns:xs="http://www.w3.org/2001/XMLSchema#">

<!-- head element, start of body and table...  -->

      <tr about="[fbi:x432]" >
        <td><span property="fb:shipmentID">x432</span></td>
        <td><span property="fb:date"
                  datatype="xs:date">2007-04-23</span></td>
        <td><span property="fb:amount"
                  datatype="xs:integer">34</span></td>
        <td><span property="fb:anodized"
                  datatype="xs:boolean" content="true">yes</span></td>
      </tr>

<!-- remaining rows of table... -->

Nearly all of this should be familiar from part 1 of this article. The only new bit of syntax in the RDFa-enhanced HTML of the table is the datatype attribute, which identifies the type of the value inside the span element. Now, an RDFa extraction routine can get these values and pass them along to an application that can do more with typed data than it can with a collection of strings. Also note that, to build on the use of the content attribute described in part 1, each row's last td element includes one of these attributes, with a value of "true" or "false" instead of "yes" or "no", which are not valid Boolean values. This way, the RDFa extractor will see proper Boolean values for the triples describing whether each widget shipment is anodized.

With all the RDFa markup added, it may look verbose, but this kind of tabular representation of data is usually automatically generated from backend relational databases anyway. It wouldn't be much trouble to have the HTML generation routines add these extra attributes, thereby making the data more valuable to other applications. Below, we'll find out about some other applications that, because they generate HTML from templates, are excellent platforms for generating lots of useful RDFa markup with minimal trouble.

The rev Attribute

In part 1 of this article, we saw that RDFa can use the a element's venerable but little-used rel attribute to indicate a resource's relationship to another resource—or, in RDF terms, to serve as the predicate of a triple, with the about attribute naming the subject and the href attribute naming the object. The a element's even less-used rev attribute expresses the opposite: a triple in which the href attribute names the subject and the about attribute of the element (or of the nearest ancestor with one) names the object.

Both rel and rev can go in the same element to describe two different relationships, such as the following one showing that the Supreme Court case Brown versus Board of Education overturned Plessy versus Ferguson:

<span about="http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=347&invol=483" 
      rel="fb:overturns" rev="fb:overturnedBy"
href="http://caselaw.lp.findlaw.com/cgi-bin/getcase.pl?court=us&vol=163&invol=537"/>

(In a system using OWL, you wouldn't really have these rel and rev attributes in the same element. You'd just have one, and a separate rule declaring that fb:overturns and fb:overturnedBy are inverse properties. This way, either could be inferred from the other.) If an ontology doesn't include the relationship that you want to specify but does offer its inverse, the difference between rel and rev gives you flexibility that can be especially valuable if you're representing a relationship between a resource you can edit and one you can't. For example, if your ontology has fb:overturnedBy but not fb:overturns, you could add the following metadata to the document at http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=347&invol=483:

<span rev="fb:overturnedBy"
href="http://caselaw.lp.findlaw.com/cgi-bin/getcase.pl?court=us&vol=163&invol=537"/>

The lack of an about attribute (assuming that no ancestor element has one either) indicates that the document itself is the object of the triple: Supreme Court case 347 US 483 overturned 163 US 537.

CURIEs

We use URIs to qualify names in order to make their context absolutely clear—for example, to show that one use of the word "title" comes from the Dublin Core namespace, and therefore refers to a published work, while another might come from a real estate document namespace and therefore refer to a deed to property. Writing out full URIs with each name (for example, http://purl.org/dc/elements/1.1/title) can make things pretty verbose, so a namespace declaration such as xmlns:dc="http://purl.org/dc/elements/1.1/" lets us use a prefix that will stand in for the URL that identifies a name's namespace. This lets us use shorter versions of our names while still being clear where they came from. We call a name such as dc:title a qualified name, or qname.

Qnames used in attribute values can lead to problems, because not all processing programs know that they should compare the prefix with the namespace declarations to see which namespace the name really comes from. It has worked out fine for XSLT use, because XSLT processors all know that qnames represent elements in source documents, but this has led to a problem in RDF use, because RDF uses URIs to identify the namespace of values as well as namespaces of elements and attributes.

For example, if I can set the standards for URL patterns at FooBar Company, and I want to represent employee number 4942 as http://www.foobarco.com/ns/empID#4942, there's no problem so far. If I say xmlns:fb="http://www.foobarco.com/ns/empID#", there's still no problem, but there is a problem if I represent the employee as "fb:4942", because it doesn't conform to the qname spec. Qnames were designed around XML names, or the names that you're allowed to make up for elements and attributes, and those names must begin with a letter. So, to keep the use of namespace prefixes instead of full URIs legal with the existing specs, we can't use them with values that begin with a numeric digit.

Lots of important values begin with numeric digits. Besides employee IDs and other ID numbers, the CURIE Working Draft points out that International Press Telecommunications Council metadata often begins with a digit. To address this, we now have a new URL abbreviation syntax knowns as the Compact URI, or CURIE, syntax. CURIEs are pretty much like URIs with looser rules for what comes after the colon: you can use any character that can be in a URI. (One handy corollary of this is that qnames are valid CURIEs.) Just about the only bit of new syntax to learn for using CURIEs is the square brackets that go around a CURIE value when used where URIs are also allowed, such as in an about or href attribute:

<tr about="[fbi:x432]" >
  <td><span property="fb:shipmentID">x432</span></td>
  <td><span property="fb:date"
            datatype="xs:date">2007-04-23</span></td>
  <td><span property="fb:amount"
            datatype="xs:integer">34</span></td>
  <td><span property="fb:anodized"
            datatype="xs:boolean" content="true">yes</span></td>
</tr>

The square brackets seem to be a nod to the syntax used to represent links in wikis. The example above would work exactly the same without the square brackets. But now, if you ever see square brackets, you'll know why they're there.

Pages: 1, 2

Next Pagearrow