Menu

Should Atom Use RDF?

August 20, 2003

Mark Pilgrim

Four Independent Issues

Here are four related but completely independent issues:

  1. The RDF model: statements are triples; use graphs not trees

  2. The RDF/XML serialization: a popular syntax for expressing individual RDF documents

  3. RDF tool support: RDFlib for Python, Drive for .NET, etc.

  4. The Semantic Web

And here are four related but completely independent counterarguments:

  1. The RDF conceptual model is overkill for specific applications, or is always overkill, or is simply the wrong model.

  2. The RDF/XML serialization is wretchedly complex and breaks the "view-source" principle for RDF documents.

  3. No RDF tools exist for my favorite language.

  4. The Semantic Web is an unattainable pipe dream, or is too fluidly defined to ever come about, or something.

The problem with discussing RDF (where that means, "I think this data format should be RDF") is that you can support any four of these RDF issues (model, syntax, tools, vision), in any combination, while vigorously arguing against the others. People who believe that the RDF conceptual model is a good thing may think that the RDF/XML serialization is wretched, or that there are no good RDF tools for their favorite language, or that the Semantic Web is an unattainable pipe dream, or any combination of these things. People who are familiar with robust RDF tools (such as RDFLib for Python) -- and, thus, never have to look at the RDF/XML serialization because their tools hide it from them completely -- may nonetheless think that RDF/XML is wretched. People who defend the RDF/XML syntax may have nothing polite to say about the vision of the Semantic Web. And around and around it goes...

This is a problem with "I think this format should be RDF" discussions. Many people who are thought to be pro-RDF are, in fact, against it in one or more ways (the model is limiting, the syntax is wretched, the tools are buggy or nonexistent, the vision is stupid). And many people who are perceived as anti-RDF are in fact in favor of it in one or more ways (the model is good, the serialization is no more complex than straight XML, the tools work well enough, the Semantic Web is worth the wait).

For the record, I think that the RDF model is sound, the tools work for me, the serialization is wretched, and the Semantic Web is an unattainable pipe dream. If I appear to be wavering over time, sometimes pro-RDF, sometimes anti-RDF, it may be that I'm simply arguing different facets.

RDF and Atom

Why do I bring this up? Because, as it happens, the Atom project is creating a new format for syndicating content and an API for a new web service. For the past week and a half it has been completely engulfed in an all-out flame war over whether it should use RDF. The discussion has been almost entirely unproductive: this question is really four questions, corresponding to the four issues:

  1. Can Atom benefit from the RDF conceptual model?
  2. Should Atom feeds use the RDF/XML syntax directly?
  3. Can I use RDF tools to consume Atom feeds?
  4. Is Atom part of the Semantic Web?

My answers? Yes, no, it depends, and I don't care.

A Wise Teacher

I sat in on an IRC chat with Sam Ruby, Shelley Powers, Sean Palmer, Joe Gregorio, and others who have contributed heavily to Atom over the past few months. About half of these people are traditionally considered pro-RDF, half anti-RDF; but as you've seen, these simplistic labels are really just another a source of confusion, so I won't tell you which person is which. The focus of the chat was to come up with an RDF serialization of Atom by taking the examples from the Atom 0.2 snapshot (which are straight XML) and creating an XSLT transformation into RDF.

During the course of this chat, all of the four issues (model, syntax, tools, vision) came up. As you might imagine, some were more constructive than others. The model was really the most constructive, in that it taught us two key things:

  1. Cardinality is vitally important to figure out up front, and the RDF model forces you to figure it out up front. This is a good thing. For example, an Atom <feed> can contain one or more <entry> elements. If you had a feed with one element, it would look like this in XML:

    <feed version="0.2" xmlns="http://purl.org/atom/ns#">
    
      <!-- some feed-level metadata omitted for brevity -->
    
      <entry>
    
        <title>Atom 0.2 snapshot</title>
    
        <link>http://diveintomark.org/2003/08/05/atom02</link>
    
        <id>tag:diveintomark.org,2003:3.2397</id>
    
        <issued>2003-08-05T08:29:29-04:00</issued>
    
        <modified>2003-08-05T18:30:02Z</modified>
    
        <summary>The Atom 0.2 snapshot is out.  Here are some sample feeds.</summary>
    
      </entry>
    
      </feed>

    Now suppose you wanted to add a second entry. You just add a second <entry> element:

    <feed version="0.2" xmlns="http://purl.org/atom/ns#">
    
      <!-- ... -->
    
      <entry>
    
        <title>Atom 0.2 snapshot</title>
    
        <link>http://diveintomark.org/2003/08/05/atom02</link>
    
        <!-- ... -->
    
      </entry>
    
      <entry>
    
        <title>Atom API primer</title>
    
        <!-- ... -->
    
      </entry>
    
      </feed>

    In other words, straight XML doesn't force you to think about cardinality until it's too late. If you looked at the first example (with only one entry) and said "Aha! A feed has an entry in it!" and went off to write code based on that assumption, you'd be borked when your code hit the second example (with two entries).

    But in RDF, collections of things are always explicit, so a feed with one entry would look like this:

    <rdf:RDF
    
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    
      xmlns:atom="http://purl.org/atom/ns#"
    
      xmlns:dc="http://purl.org/dc/elements/1.1/"
    
      xmlns:dcterms="http://purl.org/dc/terms/">
    
    <atom:Feed rdf:about="tag:diveintomark.org,2003:3">
    
      <!-- ... -->
    
      <atom:entries rdf:parseType="Collection">
    
        <atom:Entry rdf:about="tag:diveintomark.org,2003:3.2397">
    
          <dc:title>Atom 0.2 snapshot</dc:title>
    
          <atom:link rdf:resource="http://diveintomark.org/2003/08/05/atom02"/>
    
          <dcterms:issued>2003-08-05T08:29:29-04:00</dcterms:issued>
    
          <dcterms:modified>2003-08-05T18:30:02Z</dcterms:modified>
    
          <dcterms:created>2003-08-05T12:29:29Z</dcterms:created>
    
          <dc:description>The Atom 0.2 snapshot is out.  Here are some sample feeds.</dc:description>
    
        </atom:Entry>
    
      </atom:entries>
    
    </atom:Feed>
    
    </rdf:RDF>

    See the difference? Entries are always wrapped in an <entries rdf:parseType="Collection"> container element. If there's one entry, you get a collection of one; if there are two entries, you get a collection of two. But you know up front that it's a collection.

  2. The other big thing that the RDF model forced us to clarify was the concept of ordering. In XML entries within a feed are in a particular order. Is that order accidental or intentional? This, honestly, is not something we'd given any thought to. The primary use-case for syndicated feeds is that the client parses a number of feeds from different sources and puts all the entries in chronological (or reverse chronological) order. Each entry has a required <modified> date for this purpose, so the issue of the structural order of entries within an individual feed wasn't a big concern.
  3. However, RDF forces it to be a concern because there are different container types for ordered and unordered lists. Once again the rigorous RDF model forced us to consider this up front, exposing an ambiguity in our current specification. The process of converting Atom-XML into Atom-RDF forced us to clarify these issues in our conceptual model.

So is the RDF model a good thing? I think that it is; considering it made our format better, regardless of the syntax.

But the Syntax...

However, as you can see from the above snippets and the full final Atom-RDF prototype, the RDF/XML syntax is far more complex than the equivalent just-XML version. (Depending on your browser, you may need to view the source of either or both of those examples.)

Part of the problem stems from the very thing that RDF is supposed to be good at, namely, reusing and combining ontologies in a single document. You see, we kind of cheated when we created Atom-XML. The specification defines a number of elements (such as <title>) in terms of Dublin Core, but when you look at the actual Atom-XML document, you can see that we really redefined them in the Atom namespace. As a result, the XML version looks simpler as first glance because all the elements are in a single namespace which is defined as the default namespace.

In theory, you could cheat in this same way in RDF and put everything in a single namespace. But then you've pretty much negated one of the main benefits of RDF because you've redefined parts of existing ontologies and made it harder for people to integrate your RDF documents with other RDF data. Now they'll need to transform or map all your redefined elements back to their original ontologies. Since we were creating an XSLT transformation and could make the RDF look like whatever we wanted, we all agreed that we should do the right thing and reuse existing ontologies as much as possible. (This was actually the bulk of the discussion time, bickering about which ontologies to use.)

This highlights the crux of the perennial flame wars about RDF/XML: it can almost be as simple as pure XML. In fact with a few DTD tricks to default the parseType attributes, it can look virtually identical, but only if you cheat and redefine everything in your own ontology and force everyone else to map it back to other ontologies later. Or you can do the right thing and reuse existing ontologies from the beginning and then the syntax gets hellishly complex. There's always an additional cost; you can put it wherever you want, but you can't get rid of it.

So should Atom use the RDF/XML syntax directly? I vote "NO".

The best of both worlds

RDF (the model) is a good thing; RDF (the syntax) is a bad thing. "But," I hear you cry, "I don't care about the syntax because I have good RDF tools!" How can we allow you to use your RDF tools on Atom, and do the right thing with reusing existing ontologies, and keep the syntax simple for people who simply want to parse Atom feeds in isolation, as XML?

We can make the XSLT transformation normative. Here it is, the result of a 4-hour IRC chat. We should include it in the specification, maintain it as the format changes, and mandate that it is the One True Way to use Atom syndicated feeds as RDF.

Is this more work for the RDF folk? Sure. Now they need an XSLT parser as well as their favorite RDF tool. But every platform that has robust RDF tools (a small but growing number) also has robust XSLT tools.

But Atom-as-RDF is not the primary mode of consuming Atom feeds. There are dozens, perhaps more than 100, tools that consume syndication feeds now. Some of them have already been updated to consume Atom feeds and the format hasn't even been finalized yet. Most will be updated once the format is stable. And, to my knowledge, only one (NewsMonster) handles them as RDF, and it already has the infrastructure to transform XML because it does this for six of the seven formats called "RSS" (the seventh is already RDF).

In other words, we're hedging our bets. Whether a vocal minority likes it or not, RDF is very much a minority camp right now. It has a lot to offer -- I saw that first-hand as it forced us to clarify our model -- but it hasn't hit the mainstream yet. On the other hand, it seems perpetually poised to spring into the mainstream. Tool support is obviously critical here (since they help hide the wretched syntax), and the tools are definitely maturing.

So should Atom be consumed as RDF? It depends. If you want to, and have the right tools, you can. You'll need to transform it into RDF first, but we'll provide a normative way to do that. If you don't want to, then you don't have to worry about it. Atom is XML.

What About the Semantic Web?

I don't care about the Semantic Web. Next question?