Menu

Vox Populi: Web Services From the Grassroots

July 8, 2003

Rich Salz

Last month, Sam Ruby threw the blogging world into a tizzy when he created a wiki to serve as the home for a new syndication format and protocol. This month we'll take a look at the project -- the working name is "Necho" but has been "Echo" and "Pie" at various times. We'll use it to motivate a look at tradeoffs in XML and web services design.

"Syndication" is the term used when a site makes an RSS ("Really Simple Syndication") document available at a URL. For more information about the history of RSS, see Mark Pilgrim's inaugural XML.com column (What is RSS?). In what follows, I mean RSS 2.0 when I write "RSS".

Interest in RSS has been waxing, perhaps because the commercial possibilities are starting to occur to some folks. I doubt it was altruism that made Ruby's boss assign him to this project full-time, for example. The canonical web services example is a stock quote service, and translating that into an RSS feed that reports price updates is an obvious thing to do. Those of you who still have a portfolio worth managing could keep track of price movements by using any of the dozen or so RSS news aggregators that are available. As another example, my family is thinking of moving. I'd pay for a short-term subscription to an RSS feed that contained new sale listings in towns of interest to me.

RSS is described in this document, which is written and maintained by Dave Winer. Winer has been the individual most responsible for creating and proselytizing large portions of the RSS family.

Unfortunately, the RSS 2.0 specification is rather informal and imprecise. For example, here is an item within an RSS feed:

<item>

    <title>Are you <i>Crazy?<i></title>

    <description>"Are you &lt;i&gt;crazy&lt;/i&gt;?"

    she said. "Nobody in their right mind would hand-create HTML

    markup in an RSS example...</description>

    <pubDate>10 Feb 60 11:23 MST</pubDate>

</item>

This example shows some of the problems with the current RSS definition. The description may be either a summary or the complete item. The spec doesn't tell us how to tell whic it is. If it is the complete item, entity-encoded HTML is allowed; but apparently not XML: imagine an RSS feed that contained another feed's item as its summary. There's also no way to tell whether the markup is HTML or plain text. It neither makes clear how to write "5 < 7", nor does it specify if the markup tricks in that example are necessary. (Notice the markup difference between the title and description elements.) The date format is the date-time format of RFC 822, amended to allow but not require four-digit years. Not surprisingly, many RSS feeds aren't compliant, since that format is broken (think Y2K, timezones, and so on).

Sometimes RSS uses XML in a way that is rather, well, funky. For example, the guid element is a global identifier, intended to be an opaque string generated by the RSS producer to uniquely identify an RSS item. If, however, the guid element has an isPermaLink attribute with the value true, then the element content is really a permanent URL that points to the item. Attributes are usually best used for metadata. I don't think using an attribute to switch between xsd:String and xsd:anyURI qualifies as metadata.

Other than requiring every item to have either a title or description, every element within an RSS entry is optional. Additional elements can be added, provided that they appear in their own namespace. In order to support backward compatibility, RSS is not defined in any namespace. That's unfortunate, as it makes versioning very difficult.

Many RSS developers are fond of the Dublin Core Metadata Initiative which has been working to establish a set of metadata terms for nearly a decade. As a result, RSS pubDate was replaced with the more precisely-defined dc:date element. This turned out to be an improper implementation of the spec, although it took a number of individuals several days to determine the exact reason. Apparently it's not valid to replace an existing element with an extension element of similar semantics. Using a technique that can most charitably be described as curious, Winer never publicly explained the exact problem, leaving others to figure it out.

While this interpretation makes sense, it's important to realize that it prevents any evolution in the RSS core. Any part of RSS which turns out to be underspecified or just plain wrong cannot be phased out. A community consensus to move to something like dc:date can never happen. While the RSS spec always said it was frozen, it wasn't until the community went through this exercise that it realized how frozen.

Certainly a lot of increase use of RSS is due, as the name says, to it being rather simple to generate. According to Winer, one of the design goals was that anyone with an understanding of HTML could generate an RSS feed pretty quickly. That's a valid goal, but ambiguities like the aforementioned make things correspondingly more difficult for RSS consumers like the news aggregator developers. Whatever the reasons, we're nearly to the point where it's more notable that a content provider doesn't have an RSS feed than when it does.

But if RSS is going to evolve, it better happen now, while it is on the upswing, before it becomes a commodity, baked into every system, and impossible to change. Here is a portion of a Necho entry:

<entry xmlns="http://example.com/necho">

    <author>

        <name>Rich Salz</name>

        <homepage>http://www.datapower.com</homepage>

    </author>

    <link>http://example.com/glob/42</link>

    <id>371a0eb3-594a-4923-b1c0-8684d3d50f22</id>

    <created>2003-02-05T12:29:29Z</created>

    <issued>2003-02-05T08:29:29-04:00</issued>

    <modified>2003-02-05T12:29:29Z</modified>

There are a couple of things to notice here:

  • Necho is defined in a namespace; the URI isn't specified yet, but the intent is clearly there.
  • There is a lot more metadata; we've shown a partial entry, and the 10 lines above haven't even shown any content.
  • Dates are in WXS compatible format.
  • All "overlap" is removed: RSS's guid element is replaced by link and id.
  • All ambiguity is replaced by explicit elements, such as the three different dates.

The entry content element is where the biggest difference occurs. The Necho content element has a type attribute to contain the MIME-compatible content-type. This is brilliant, as it allows Necho to smoothly integrate with work on adding attachments to SOAP. It's also multicultural, allowing the xml:lang attribute to specify the language being used. And, finally, multiple content elements act as a MIME multipart/alternative construct, allowing an RSS reader to find the representation it can best support.

Here are some example elements:

    <content type="text/plain" xml:lang="en-us">

    Are you *crazy*?

    </content>

    <content type="text/html" xml:lang="en-us">

    Are you &lt;i&gt;crazy?&lt;/i&gt;

    </content>

Is this technically better than RSS? It clearly is better. The ambiguities are gone, the metadata is more precise, the ability to provide rich and accurate content is now provided, and the use of XML is quite clean. Unlike RSS, it's feasible to define a schema for Necho. DTDs, XML Schema, and Relax NG are all in the works. In other words, validation won't require a special-built validator. News aggregators and other RSS consumers -- if they are written as XML applications -- should have an easier job of presenting more information to their users. Generating a Necho feed does not look to be that much harder than generating an RSS feed, only requiring the tweaking of a few output statements or templates. Creating a Necho-to-RSS stylesheet in XSLT should be fairly straightforward. So from the technical front, it looks like everyone will win.

Is it politically and socially better? The jury is still out. Radical format changes rarely win converts, and there are many who believe that the window of opportunity for change has already passed. At the beginning of this column, I said that Necho was defining a protocol as well as a data format. I'll look at that in full detail in the next column.

More from Rich Salz

SOA Made Real

SOA Made Simple

The xml:id Conundrum

Freeze the Core

WSDL 2: Just Say No

The blogging community has defined several APIs for distributed manipulation of weblogs. These include the ability to add comments, post trackbacks (essentially a comment that says my article at this URL linked to you), and to ping servers informing them of updates. Most of these were quick hacks or first drafts that were eagerly adopted by multiple vendors. Most of these vendors are now interested in developing a new generation of APIs to provide these features -- and others, such as search, archiving, etc. -- in a new and consistent manner.

The wiki has a fairly active discussion about how to best define that protocol. Not surprisingly, I advocate that it be done using SOAP to convey XML documents. As long as the content being delivered is namespace-qualified, SOAP is a surprisingly lightweight messaging envelope:

<S:Envelope xmlns:S="http://www.w3.org/2003/05/soap-envelope">

    <S:Body>

        ...content here...

    </S:Body>

</S:Envelope>

Just because the full web services machinery (WSDL, WXS, all those WS-xxx specs) rides atop SOAP, that doesn't mean that SOAP itself should be avoided. As we'll see next time, using SOAP as the messaging envelope enables all those features but doesn't require them. And along the way, we'll discover where REST becomes less useful.