Converting XML to RDF

September 1, 2004

Last month we looked at the REST interface to Amazon Web Services (AWS), and how an f parameter in a URL calling this interface can point to an XSLT stylesheet. If you set it to "xml" instead of pointing it at a stylesheet, Amazon returns data in formats that conform to either the "lite" or "heavy" DTDs (and corresponding schemas) included with their SDK; if you do, their server applies the stylesheet to that data at the server before returning the result to you.

In that column, I promised to show how to use this feature to pull RDF from the Amazon servers. I had written a stylesheet called aws2rdf.xsl, but the more I thought about it the more I realized that such a stylesheet needed very few dependencies on the Amazon Web Services DTDs, and that it could convert a wide variety of XML to RDF. So, I revised and renamed it to xml2rdf.xsl, and we'll look at it here.

RDF and Data-Oriented XML

RDF/XML sometimes looks strange, but it doesn't need to to. RDF-friendly XML adds a few things to otherwise typical-looking XML so that an RDF parser can treat all of its information as RDF triples.

This is not too difficult as long as your XML has no text nodes with elements as siblings. For example, <p>this p element has <emph>three</emph> text nodes and <emph>two</emph> emph elements</p>. XML developers often call this "mixed content" because the p element's contents are a mix of text nodes and elements. The official definition of mixed content, however, is any element type that may have any character data, <p>even a p element like this</p>.

An element that has only character data and isn't "mixed" in the more popular sense can often be converted to RDF/XML without much trouble. Many applications use these elements, along with element content container elements that group these elements, to represent transactions and database records — what people often call "data-oriented" XML, despite the fact that all XML is data (or rather, data objects). The kind of XML used to describe narrative content for publication in one medium or another — what people call "document-oriented" XML, despite all XML being in documents — is more likely to have elements and text nodes as siblings of each other (like in the first p example in the preceding paragraph), and is not a good candidate for automated conversion to RDF.

The data being returned by Amazon Web Services, which obviously came from relational databases somewhere, is a fine candidate for conversion to RDF. Besides, Amazon is in the business of selling physical objects, and its site provides metadata about those objects. Having that data in RDF-friendly XML makes it easier to link this metadata with other metadata, thereby extending the potential reach of the Semantic Web.

A Somewhat Generic XML to RDF Converter

When processing XML documents that are good candidates for conversion to RDF/XML, a stylesheet can handle certain tasks generically. Other tasks require modifications to the conversion stylesheet to prepare it for the specific input that's coming. The generic parts of the stylesheet below, which come after the comment beginning with the words "End of template rules addressing," automate the advice given in the XML.com article Make Your XML RDF-Friendly. Rule numbers mentioned below refer to the numbered pieces of advice in that article.

The first half of the stylesheet has the parts that require editing to prepare the stylesheet for your particular source documents. The bold parts show my customizations to tailor the stylesheet for documents returned by Amazon Web Services:

As Rule Number 1 says, make sure that every element comes from a specific namespace, so the namespace must be declared. I clipped the filename off the URIs used for the U.S./Japan versions of the DTDs and schemas to come up with http://xml.amazon.com/schemas3/ as an Amazon Web Services namespace URI.
The result of the transformation will be metadata about a single resource, and the "resourceURL" variable is where the stylesheet stores the URL of that resource. While there are several variations on the basic URI that take you to the web page describing a particular book on Amazon, the developer's kit describes a format of http://www.amazon.com/exec/obidos/ASIN/ followed by the ASIN number, so the stylesheet below constructs this URL by appending the ASIN number (using an XPath expression to pull it out of the XML) to that URI string.
The generic code later in the stylesheet uses the namespace prefix for the described resource's properties in several different places, so storing it in a variable lets us leave the generic code alone. This should be the prefix declared with the namespace URI added to the xsl:stylesheet start-tag — in this case, "aws."
You won't necessarily want every element in your source document passed along to your RDF version, so add the names of the ones to suppress to the stylesheet's first template rule.
Similarly, certain container elements in the source won't add anything to the RDF version, so adding their names to the second template rule tells the stylesheet to pass along their contents without their enclosing tags. (As we'll see, certain containers are very useful, so we'll keep them.)

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
                xmlns:aws="http://xml.amazon.com/schemas3/">

  <!-- Convert XML to RDF that all describes one resource. Template
       rules after "End of template rules" comment are generic; those
       before are for customizing treatment of source XML
       (e.g. deleting elements). -->

  <!-- URL of the resource being described. -->
  <xsl:variable name="resourceURL">
    <xsl:text>http://www.amazon.com/exec/obidos/ASIN/</xsl:text>
    <xsl:value-of select="/ProductInfo/Details/Asin"/>
  </xsl:variable>

  <!-- Namespace prefix for predicates. Needs a corresponding xmlns
       declaration in the xsl:stylesheet start-tag above. If your set
       of predicates come from more than one namespace, than this
       stylesheet is too simple for your needs. -->
  <xsl:variable name="nsPrefix">aws</xsl:variable>

  <!-- Elements to suppress. priority attribute necessary 
       because of template that adds rdf:parseType above. -->
  <xsl:template priority="1" match="Request|TotalResults|TotalPages"/>

  <!-- Just pass along contents without tags.  -->
  <xsl:template match="ProductInfo|Details">
    <xsl:apply-templates/>
  </xsl:template>

  <!-- ========================================================
       End of template rules addressing specific element types.
       Remaining template rules are generic xml2rdf template rules. 
       ======================================================== -->

  <xsl:template match="/">
    <rdf:RDF>
      <rdf:Description
       rdf:about="{$resourceURL}">
        <xsl:apply-templates/>
      </rdf:Description>
    </rdf:RDF>
  </xsl:template>

  <!-- Elements with URLs as content: convert them to store 
       their value in rdf:resource attribute of empty element -->
  <xsl:template match="*[starts-with(.,'http://') or starts-with(.,'urn:')]">
    <xsl:element name="{$nsPrefix}:{name()}">
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="."/>
      </xsl:attribute>
    </xsl:element>
  </xsl:template>

  <!-- Container elements: if the element has children and an element parent 
       (i.e. it isn't the root element) and it has no attributes, add
       rdf:parseType = "Resource". -->

  <xsl:template match="*[* and ../../* and not(@*)]">
    <xsl:element name="{$nsPrefix}:{name()}">
      <xsl:attribute name="rdf:parseType">Resource</xsl:attribute>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

  <!-- Copy remaining elements, putting them in a namespace. -->
  <xsl:template match="*">
    <xsl:element name="{$nsPrefix}:{name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

</xsl:stylesheet>

The generic part of the stylesheet has four template rules:

The first template rule in the generic part (the third template rule in the stylesheet) wraps the contents in an rdf:RDF element and identifies the resource being described.
The next template rule implements RDF-friendliness Rule Number 4, converting any elements whose contents consist of a URI (or rather, any elements whose contents begin with "http://" or "urn:") into empty elements with the URI stored in an rdf:about attribute.
The stylesheet's second-to-last template rule follows the advice given near the end of RDF-friendliness Rule 6 by adding an rdf:parseType attribute with a value of "Resource" to container elements that aren't the root element of the document. This way, these containers won't throw off the striping pattern of nested predicate/object pairs that an RDF processor expects to find in an RDF/XML document.
The stylesheet's last template rule copies any elements not covered by the other template rules to the result tree with the namespace prefix from the nsPrefix variable added onto their names.

I tested this with both "lite" and "heavy" XML returned by Amazon Web Services for various books, CDs, authors, and bands, and the ARP2 RDF parser had no problem with any of the results. (For authors and bands, though, the RDF isn't quite semantically correct, because all of the triples created by the stylesheet have the same subject, so it makes more sense to use this for Amazon pages that describe a single work such as a book or CD.) For example, with the stylesheet stored at http://www.snee.com/xsl/xml2rdf.xsl, the following REST URL (with carriage returns deleted and a working developer ID substituted for "dev-ID-here") retrieves kosher RDF metadata (saved version here; when viewing with a browser, do a View Source to see the RDF/XML) about the boxed set of Robert Quine's live recordings of the Velvet Underground:

http://xml.amazon.com/onca/xml3?locale=us&t=bobducharmeA
&dev-t=dev-ID-here&AsinSearch=B00005Q567&mode=music
&type=heavy&f=http://www.snee.com/xsl/xml2rdf.xsl

Also in Transforming XML

Automating Stylesheet Creation

With the appropriate revisions to the bold parts of the stylesheet above, there's a lot of regularly structured XML out there that could be converted to RDF. The great thing about using it on XML returned by Amazon Web Services is that we can execute the XSLT transformation on Amazon's servers, so a single REST URL can retrieve RDF directly from Amazon. This is the power that Amazon has put into our hands by letting us use its server-side XSLT processor with its database.

(For more on mapping XML to RDF using XSLT, see Michael Sperberg-McQueen and Eric Miller's Extreme 2004 paper On mapping from colloquial XML to RDF using XSLT.)