XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Introduction

The boom of weblogs has boosted interest in techniques for syndicating news-like material. In response a family of applications, known as aggregators or newsreaders, have been developed. Aggregators or newsreaders consume and display metadata feeds derived from the content. Currently there are two major formats for these data feeds: RSS 1.0 and RSS 2.0. Mark Pilgrim covers these two flavors of RSS in his XML.com article "What is RSS?"

The names are misleading -- the specifications differ not only in version number but also in philosophy and implementation. If you want to syndicate simple news items there is little difference between the formats in terms of capability or implementation requirement. However, if you want to extend into distributing more sophisticated or diverse forms of material, then the differences become more apparent.

The decision over which RSS version to favor really boils down to a single trade-off: syntactic complexity versus descriptive power. RSS 2.0 is extremely easy for humans to read and generate manually. RSS 1.0 isn't quite so easy, as it uses RDF. It is, however, interoperable with other RDF languages and is eminently readable and processible by machines.

This article shows how the RDF foundation of RSS 1.0 helps when you want to extend RSS 1.0 for uses outside of strict news item syndication, and how existing RDF vocabularies can be incorporated into RSS 1.0. It concludes by providing a way to reuse these developments in RSS 2.0 feeds while keeping the formal definitions made with RDF.

RSS 1.0 Terms Have a Formal Definition

RSS 1.0 documents conforms to the RDF/XML Syntax Specification. This means that they are expressed in the language described in RDF Concepts and Abstract Syntax, which has the precise formal semantics defined in RDF Semantics. Unless you're a logician or have masochistic tendencies, you probably won't want to follow the path all the way to the formal base. For most developers the RDF Primer contains plenty to get started. The take-home message is that, unlike with plain XML, which is just syntax, there is well-known meaning that programs can derive from an RDF/XML document.

There is another part of the RDF specification that we need to consider when talking about RSS 1.0: RDF Schema. In the jargon, the RDF Schema specification defines an ontology language. An ontology gives names to concepts and relationships between those concepts. An ontology is really just a tightly controlled vocabulary; to some extent in this context the words "ontology", "vocabulary", and "schema" are interchangeable (in the RSS world, module is often used to refer to essentially the same thing).

RSS 1.0 may be a format defined in human language in the main specification document, but it is also an ontology that is specified in formal language in the RSS 1.0 RDF schema. Consider the RSS 1.0 snippet below.

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/">

...
<item rdf:about="http://example.com/2003/09/29#9">
<title>The Joy of Blogs</title>
...

The example uses the <item> and <title> terms, and they can be found in the schema defined like this :

...
<rdfs:Class rdf:about="http://purl.org/rss/1.0/item" rdfs:label="Item" rdfs:comment="An RSS item.">
<rdfs:isDefinedBy rdf:resource="http://purl.org/rss/1.0/" />
</rdfs:Class>
<rdf:Property rdf:about="http://purl.org/rss/1.0/title" rdfs:label="Title" rdfs:comment="A descriptive title for the channel.">
<rdfs:subPropertyOf
rdf:resource="http://purl.org/dc/elements/1.1/title" />
<rdfs:isDefinedBy rdf:resource="http://purl.org/rss/1.0/" />
</rdf:Property>
...

The main things being said here are that item in RSS 1.0 is an RDF class and that title is a property. RDF classes more or less correspond to concepts, and properties are used to describe the relationships between those concepts. So returning to our example, it can be demonstrated that there's more being said in the example than is immediately obvious:

...
<item rdf:about="http://example.com/2003/09/29#9">
<title>The Joy of Blogs</title>
...

This says first of all that the resource identified as http://example.com/2003/09/29#9 is an instance of the class item. The RDF/XML syntax provides a specific interpretation of the nesting of the XML, which allows us to determine that the resource has a property title, and the value of the property is the literal string "The Joy of Blogs". This still doesn't seem to offer much advantage over plain XML. But what we have isn't just given in terms of human-readable documentation, it's defined with unambiguous definitions throughout, traceable back to the logical formalism of RDF. These semantics allow us to not only make statements about the item but to reason programmatically with those statements.

In the RDF Schema snippet above, it also says that the title property is a subproperty of the resource http://purl.org/dc/elements/1.1/title, an element defined by the Dublin Core Metadata Initiative. We can then infer from these statements that the literal "The Joy of Blogs" is also related to the item as a Dublin Core title. If, for example, a browser-like application were reading the data, but didn't know how to render rss:title, it could reasonably substitute the renderer for dc:title.

What do we gain from all this formal grounding? If RSS processing alone is our universe, maybe not a lot. But as soon as we want to start integrating our RSS with other RDF data, or merge other data into our RSS, we start to reap rewards.

Extending RSS: Software Releases

As an example of extending RSS, we'll take a software company's product announcement RSS feed. Periodically they release updates to their product, and they would like the announcement of the update to be an automated part of the release process. So when a new release build is made, an item will be inserted into their news feed that contains the product name and the release version.

We create an RSS module by defining the properties we need, explaining their usage and associating them with a unique namespace. On the face of it, this is a trivial exercise -- for the update module we can just define a couple of simple elements:

  • product - the name of the product. A character string.

  • version - the version of the release expressed as a string in the format x.y where x is the major version number and y the minor version number.

For a namespace we just need a URI, ideally one under our control. So if we have registered the domain name supersemantics.com then we could use that as a base. It's a good idea to recommend a prefix to use for the namespace within XML documents, and here we shall use rel.

Here's what this might look like in our RSS 1.0 feed.

  xmlns:rel="http://supersemantics.com/ns/release/" 
...
<item rdf:about="http://supersemantics.com/release/2003/06/19#9">
<title>New Release</title>
<dc:date>2003-06-19T14:02:33+01:00</dc:date>
<rel:product>IronBoard</rel:product>
<rel:version>2.3</rel:version> 
...

The date in RSS 1.0 is expressed using a W3C Date Time Format DTF (W3CDTF), a profile of the ISO 8601 standard.

By using the RDF document the syntax here we actually says more than we would with plain XML. The product and version elements are actually RDF properties, relating the item resource to literal strings. There are two statements being made here which can be expressed as subject (what's being described), predicate (the property), and object (value of that property):

http://supersemantics.com/release/2003/06/19#9 rel:product "IronBoard"
http://supersemantics.com/release/2003/06/19#9 rel:version "2.3"

The (subject, predicate, object) statement is an important concept in the RDF world and is usually referred to RSS 1.0 RDF Schema as a triple. The subject of one triple may be the object of another and vice versa. This means the triples can also be thought of as a joined-up structure, and that structure is the RDF graph.

So what's the big deal? The relationship between the item and the product name and version number is already defined. We can load our RSS file into any RDF aware toolkit (and there are plenty, see Dave Beckett's Resource Guide) and have it immediately know that an item has properties product and version. We don't need any more programmer logic to extend the data model.

If we wish to offer our new module for reuse by others we can, in the same way that the item and title properties are defined in the RSS 1.0 RDF Schema, provide a schema with formal definitions for our terms.

Working with Existing Vocabularies

We noted earlier that the RSS 1.0 title property was actually a subproperty of Dublin Core's title. Some parts of the RSS 1.0 vocabulary such as dc:date and dc:creator are used directly from Dublin Core. Generally speaking it's good practice to use existing vocabularies directly wherever possible, as it's the best route to interoperability. A common scenario is that a general purpose vocabulary contains a term close to what we're looking for, but our requirement is more specific. The solution here is to define our own term as a subclass or subproperty of the existing term (depending whether the term applies to an entity or a relationship between entities). Thus the child class (or property) takes on the same characteristics as its parent, in addition to anything specific to the child.

As it happens, there is at least one existing vocabulary designed to describe software releases. In fact, the release schema at eikster.com contains terms that directly correspond to our product and version called name and version. We can inherit their descriptions by making our properties subproperties of them.

There is one significant difference between eikster.com's properties and ours -- their schema provides a Release class, to which the properties apply. Looking back at the RSS 1.0 example, we have our product and version applied to an RSS item -- the resource on the left-hand side of the triples is an item, on the right-hand side we have a string literal. We can use RDF Schema to say we want the domain (left-hand side) of our properties to be instances of item and the range (right-hand side) to be literals. Note that the domain and range are primarily descriptive, they don't in themselves offer any real constraint as found in WXS. It's up to applications to interpret this as they wish (true constraints can be added using the Web Ontology Language OWL).

A few more things that are easy to add to the schema and are likely to be useful are human-readable labels and comments for each property and references to their definition. Including a reference to the definition might seem a little redundant in part of the definition itself, but the statements in an RDF Schema may be used outside of their original context.

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">

<rdf:Property rdf:about="http://supersemantics.com/ns/release/product">
<rdfs:label>Product Name</rdfs:label>
<rdfs:comment>The official name of a software package</rdfs:comment>

<rdfs:subPropertyOf rdf:resource="http://eikster.com/2003/release#name" />
<rdfs:domain rdf:resource="http://purl.org/dc/elements/1.1/item"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
<rdfs:isDefinedBy rdf:resource="http://supersemantics.com/ns/release"/>
</rdf:Property>
<rdf:Property rdf:about=""http://supersemantics.com/ns/release/version">
<rdfs:label>Release Version</rdfs:label>
<rdfs:comment>The release version of a software package,
given in major.minor format, e.g. 2.3</rdfs:comment>
<rdfs:subPropertyOf rdf:resource="http://eikster.com/2003/release#version" />
<rdfs:domain rdf:resource="http://purl.org/dc/elements/1.1/item"/>
<rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>
<rdfs:isDefinedBy rdf:resource="http://supersemantics.com/ns/release"/>
</rdf:Property>


</rdf:RDF>

Together with this schema, RDF Schema-aware software that understands eikster.com's Release classes will also be able to understand our RSS items, as we have defined how they relate.

Bringing RSS 2.0 to the Party

There are various reasons, substantially matters of personal preference, why some may prefer an RSS 2.0 format. If we can map RSS 2.0 with our extension module unambiguously to the equivalent RSS 1.0 version, then what we have done is to effectively turned the XML syntax into a task-specific serialization of RDF. We can get all the semantic goodness of RDF in the simple XML packaging of RSS 2.0. This is the approach taken by my project, Simple Semantic Resolution (SSR), which is actually defined as an RSS 2.0 module. A step-by-step description, SSR-Enabling an RSS 2.0 Module, is available, but we have already looked at most of these steps already here. What we haven't done yet is defined the mapping. In SSR this is done by supplying an XSLT stylesheet that can carry out transformations of documents using our module in combination with RSS 2.0 into their RSS/RDF counterpart.

A stylesheet is available (thanks to Sjoerd Visscher) that can convert core RSS 2.0 into RSS 1.0, so all we have to do is to do the extra needed to convert our XML elements and contents into RDF properties and objects via a syntactical transformation. Which for our software release module is absolutely nothing. Sjoerd's XSLT passes through unchanged any XML that isn't recognised as RSS, and that's exactly what we want for our syntax.

So all we have to do to give instances of our extended RSS 2.0 the RDF semantics is to use SSR to identify the transform that defines the mapping. All this takes is the insertion of an extra element into the RSS just below the root level, so our enriched RSS 2.0 will look like this:


<rss version="2.0"
xmlns:rel="http://supersemantics.com/ns/release/" xmlns:ssr="http://purl.org/stuff/ssr">
<ssr:rdf transform="http://ideagraph.net/xmlns/ssr/source/rss2rdf.xsl" /> ...
<item>
<title>New Release</title>
<pubDate>Sat, 19 Jun 2003 14:02:33 GMT+1</pubDate>
<link>http://supersemantics.com/release/2003/06/19#9</link>
<rel:product>IronBoard</rel:product>
<rel:version>2.3</rel:version>
...
</rss>

A regular RSS 2.0 client can understand this, as there is no change to the core format.

Conclusion

RSS 1.0's strong point is its use of the RDF model, which enables information to be represented in a consistent fashion. This model is backed by a formal specification which provides well-defined semantics. From this point of view, RSS 1.0 becomes just another vocabulary that uses the framework. In contrast, outside of the relationships between the handful of syndication-specific terms defined in its specification, RSS 2.0 simply doesn't have a model. There's no consistent means of interpreting material from other namespaces that may appear in an RSS 2.0 document. It's a semantic void. But it doesn't have to be that way since it's relatively straightforward to map to the RDF framework and use that model.

The scope of applications is often extended, and depending on how you look at it, it's either enhancement or feature creep. Either way, it usually means diminishing returns -- the greater distance from the core domain you get, the more additional work is required for every new piece of functionality. But if you look at the web as one big application, then we can to get a lot more functionality with only a little more effort.



1 to 4 of 4
1 to 4 of 4