XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Make Your XML RDF-Friendly
Pages: 1, 2

6. Be careful about the use of container elements.

The good news is that a given resource can be both the object of one or more RDF statements and the subject of others. For example, the following shows that Bridget Fonda's father is Peter Fonda and that Peter Fonda's father is Henry Fonda. Peter is the object of the statement made by the outer triple and the subject of the inner one.

<Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Bridget">
  <gc:father>
    <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Peter">
      <gc:father>
        <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Henry"/>
      </gc:father>
    </Entertainer>
  </gc:father>
</Entertainer>

There's no limit to the level of nesting, as long as even-numbered elements in the line of descendants are resources and odd-numbered resources are predicates. This alternating relationship is known in RDF circles as striping.

The bad news is that many common uses of container elements throw this striping pattern off. The following example, which omits the document element and namespace declarations, is otherwise perfectly good RDF until the attachments element.

<email rdf:about="msg001">
  <from>bram@snee.com</from>
  <to>bela@snee.com</to>
  <date>20021024T081423</date>
  <msgSubject>Dinner tonight</msgSubject>
  <attachments>
    <attachment>data\sample1.txt</attachment><!-- RDF parser chokes here -->
    <attachment>data\sample2.txt</attachment>
  </attachments>
  <cc>frank@snee.com</cc>
</email>

Up to that point, an RDF parser knows that the resource with the ID "msg001" has a from value of "bram@snee.com", a to value of "bela@snee.com", and so on, but what is the attachments value? If its contents were an XML element, it would have to be just one element, with an identifier that named it as a specific resource. Having more than one element -- which is the whole point of the wrapper, because a given e-mail message may have more than one attachment -- is something that RDF can't handle when represented this way. It thinks that the attachments property of the email resource has two properties of its own (the two attachment elements). Properties can't have properties, but resources can.

There are two obvious options for giving this email element the resource-predicate-resource-predicate descendant structure that RDF expects: either remove a layer of containment or add one. Removing the attachments container would make each attachment element a sibling of from, to, and the email element's other children, and email wouldn't have any grandchildren:

<email rdf:about="msg002">
  <from>bram@snee.com</from>
  <to>bela@snee.com</to>
  <date>20021024T081423</date>
  <msgSubject>Dinner tonight</msgSubject>
  <attachment>data\sample1.txt</attachment>
  <attachment>data\sample2.txt</attachment>
  <cc>frank@snee.com</cc>
</email>

The problem with this is that you may have a good reason to use that container. For example, when processing your XML e-mail messages using an event-based model such as the SAX API, maybe there's something specific you want to do when you reach the end of the attachment list. How do you know you've reached the end of that list when processing this version of the email element? When you reach the cc element? What if cc is optional? Nothing says "end of attachment list" like an </attachments>.

If you must have a container around your attachment elements, and want to make it proper RDF, one solution is to use one of RDF's specialized container elements. In this case, you can wrap an rdf:Bag element around the attachment elements in the original e-mail example, inside of the attachments element. (In keeping with guideline 2, the attachments element has been given an rdf:ID attribute to make it easier for a parser to refer to it.) The rdf:Bag element describes a container whose contents aren't ordered in any meaningful way. The example's rdf:Bag element has an rdf:ID value of "i2", telling an RDF parser that in addition to having a from property with a value of "bram@snee.com", as well as the other properties we saw, the resource with the ID "msg003" also has an attachments property with resource #i2 has its value. This i2 resource has a type of rdf:Bag, which RDF parsers understand to be a container of unordered content. The i2 resource has one attachment with a value of "data\sample1.txt" and another with a value of "data\sample1.txt". And, unlike the first e-mail example above, this one causes no error message in the RDF parser.

<email rdf:about="msg003">
  <from>bram@snee.com</from>
  <to>bela@snee.com</to>
  <date>20021024T081423</date>
  <msgSubject>Dinner tonight</msgSubject>
  <attachments rdf:ID="i1">
    <rdf:Bag rdf:ID="i2">
      <attachment>data\sample1.txt</attachment>
      <attachment>data\sample2.txt</attachment>
    </rdf:Bag>
  </attachments>
  <cc>frank@snee.com</cc>
</email>

In addition to the rdf:Bag container for unordered content, RDF also offers the rdf:Seq element for ordered (or "sequenced") content and the less popular rdf:Alt container to show available alternatives to a specified value.

There is actually a third, even simpler option for converting this email element's structure into something that won't confuse the RDF parser: we can explicitly tell this parser that the attachments property of the email element is itself a resource with the rdf:ParseType attribute:

<email rdf:about="msg004">
  <from>bram@snee.com</from>
  <to>bela@snee.com</to>
  <date>20021024T081423</date>
  <msgSubject>Dinner tonight</msgSubject>
  <attachments rdf:parseType="Resource">
    <attachment>data\sample1.txt</attachment>
    <attachment>data\sample2.txt</attachment>
  </attachments>
  <cc>frank@snee.com</cc>
</email>

Think about the original problem: the attachments property of the email element couldn't have its own properties, which is why the RDF parser choked at the first attachment element -- it thought that the document was trying to name a property of a property, which is illegal. Now that the attachments element is explicitly named as a resource, it can have properties, so the RDF parser will have no problem with the two attachment children of this element.

7. Eschew mixed content.

Mixed content presents a more advanced version of the problem caused by containers that throw off the striping pattern. Once you see that the resources described in RDF statements must either be siblings of each other or skip an odd number of generations when descendants of each other, and that predicates must be descendants found at the levels between those, it's clear how the typically irregular patterns of mixed content can throw off RDF striping. Mixed content can also put strings of PCDATA in odd places -- or at least in places that seem odd if you're looking for regular recurring patterns.

This doesn't mean that you can't have RDF in a document with mixed content. The "Moby Dick" example at the beginning of this article has mixed content, and the rdf:RDF element showing publishing metadata such as the work's creator and availability date is kept separately in an RDF header section.

RDF statements in a mixed content document can even use elements within the mixed content as resources. The following example has an rdf:RDF header element that contains a made-up imgLink element linking the character in-line element to an image on a remote server.

<article xmlns="http://www.snee.com/ns/dummy#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <rdf:RDF>
    <imgLink rdf:about="#c1">
      <image rdf:resource=
           "http://www.keele.ac.uk/depts/as/Literature/Moby-Dick/images/Moby.gif"/>
    </imgLink>
  </rdf:RDF>
  <body>
    <title>Moby Dick</title>
    <para>Call me <character rdf:ID="c1">Ishmael</character>.</para>
    <para>Just don't call me late for supper.</para>
  </body>
</article>

An RDF parser will find the statement linking the character element to the Moby.gif picture and will have no problem with the mixed content along the way.

8. Find an RDF parser to check that your RDF statements are okay.

When learning any new language, you want to be sure that what you think you're saying is really what you're saying. Most RDF parsers make this easy by outputting a subject-predicate-object triple for each RDF statement they find. For example, the W3C's RDF Validation Service turns this document

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:imdb="http://us.imdb.com/Name?"
     xmlns="http://www.cyc.com/2002/04/08/cyc.daml#"
     xmlns:gc="http://www.daml.org/2001/01/gedcom/gedcom#">

  <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Bridget">
    <gc:father>
      <Entertainer rdf:about="http://us.imdb.com/Name?Fonda,%20Peter"/>
    </gc:father>
  </Entertainer>
</rdf:RDF>

into this (carriage returns added):

<http://us.imdb.com/Name?Fonda,%20Peter> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://www.cyc.com/2002/04/08/cyc.daml#Entertainer> .

<http://us.imdb.com/Name?Fonda,%20Bridget> 
<http://www.daml.org/2001/01/gedcom/gedcom#father> 
<http://us.imdb.com/Name?Fonda,%20Peter> .

<http://us.imdb.com/Name?Fonda,%20Bridget> 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://www.cyc.com/2002/04/08/cyc.daml#Entertainer> .

Or, in English, using only the URI fragment identifiers:

  • Peter Fonda has a type value of Entertainer.

  • Bridget Fonda has a father value of Peter Fonda.

  • Bridget Fonda has a type value of Entertainer.

In general, using a utility to convert RDF to triples helps you to understand exactly what is being said if you read the subject-predicate-object triple "X, Y, Z" as "X has a Y value of Z." All the natural language descriptions of RDF statements in this article were checked this way.

As RDF tools become more widely available and easy to use, you'll have more resources available to do improved metadata management for your own data. Even if you're not ready to build serious RDF applications just yet, making more of your own data RDF-friendly will do more than widen the number of applications that can use it. For many people, the kinds of things that RDF is good at become clearer to them when used with data that is important to their business or important to them personally, such as an address or appointment file. Using RDF tools to play with your own data will help you understand the strong points of RDF and, perhaps, even the strong points of your own data better.



1 to 4 of 4
  1. Questions about ID vs. rdf:ID.
    2003-08-22 19:05:41 Adrian Boyko
  2. GREAT article
    2002-11-13 12:22:33 Anton Swanevelder
  3. Possible typos?
    2002-11-01 14:04:49 Robert Barta
  4. Excellent Article
    2002-11-01 02:09:19 Daniel Zambonini
1 to 4 of 4