Identifying Atom

August 18, 2004

Atom is an emerging standard for XML workflow: web editing, web publishing (including syndication), and archiving. It has recently become the basis of an Internet Engineering Task Force (IETF) working group, co-chaired by Paul Hoffman and Tim Bray. (Yes, that Tim Bray). Since joining the IETF, the formerly ad-hoc but relatively noisy mailing list has exploded into a cacophony of ideas and opinions.

This is a good thing. Making good standards is harder than it looks, and healthy debate helps shake out the kinds of egregious error that make many pseudo-standards difficult or even impossible to implement. In this article, and in the coming months, I'll take an in-depth look at some of the issues that have shaken out on the atom-syntax mailing list.

The Identity Issue

I publish an Atom feed, full of entries. You want to read each entry only once. Or maybe not. Sometimes I update entries after the fact. So maybe you want to read it more than once, but only if it's changed. But how do you know if it's changed? You need the equivalent of a primary key in a database table to know that it's the same entry, before you can compare the new version to the old version. Otherwise you might mistakenly think it was a new entry, and redisplay it. This happens all the time with RSS, and solving this problem is one of the major design goals for Atom.

So what do you use for a primary key? You might think the entry link is a good candidate, but that itself can change, especially if it's auto-generated from the title and the title changes. Many weblog systems have an option to do this. You can't use the title for the same reason. You can't just generate an MD5 hash of the content, since the content may have changed. You can't use the publication date since many systems allow you to change that after the fact. Besides, the publication date is only unique to you; as we'll see in a minute, we need something that's unique across all feeds everywhere.

You really need a separate element that is specifically an identifier, something that never changes, even if everything else does.

But wait, it gets worse. I publish an Atom feed, full of entries. But my entries aren't just published in one place. I have a main feed that contains all my entries, but I also have category-specific feeds that contain a selection of entries. So each entry shows up in at least two places. But some entries are cross-posted to multiple categories, so they might show up in three places or even more.

Then there's Planet Python, which takes feeds from a variety of Python-related news sites and weblogs, and republishes them on a new site. So not only are my entries being published in multiple locations on my own site, they're actually being published on multiple different sites.

You need identifiers that are not just unique, but that are globally unique. If the entry gets republished on a different site, the ID should stay the same, and that shouldn't cause any problems. If the ID changed every time the entry got republished, then you couldn't tell that it was the same entry, and if you were subscribed to both feeds you could end up with duplicate entries in your aggregator. Our primary keys need to be persistent, unchanging, and globally unique.

The most recent version of RSS has an element called <guid> that attempts to solve this problem. However, there are several problems with it:

Older RSS versions don't have it, and even in the latest version, it's still optional. So very few feeds actually have it.
The RSS spec doesn't give clear guidance on how to make a unique identifier, or how unique it really needs to be, or why you would bother. So many publishers generate useless IDs.
It's difficult to compare them, because the data type of the <guid> element isn't stable. If a certain attribute is present and contains a certain value, then the element must be treated as a string. But in other cases, the element must be treated as a URL. As we'll see in a minute, these data types have different rules for equality, so comparing GUIDs is more difficult than it sounds.

The Hidden Complexity of URIs

All of this has been discussed on atom-syntax. Should an ID be required? Yes, everyone agreed early on that entry IDs were useful enough that they should be required. Should they be globally unique? Again, everyone seems to agree that they should. The trouble comes when you ask what form these required, globally unique identifiers should take.

There are two main camps within the Atom community: one believes that identifiers should be URIs, the other believes they should be strings. The benefits of URI identifiers are that many programming languages have classes and libraries to deal with them, compare them, generate them from a set of criteria, and so forth. They are well defined, and lots of thought has gone into using them as identifiers. This is more important than it sounds, because making globally unique identifiers is harder than it sounds. Virtually all of the recent work in this space has built on URIs, and we want to build on that work.

However, URIs come with a major downside: they are difficult to compare. A quick example: http://example.com/ and http://EXAMPLE.COM/ are the same URI, because domain names are case-insensitive. Try them in your browser; one is as good as the other. (Your browser will probably auto-convert it to lowercase for you.) But http://example.com/~smith/ and http://example.com/~SMITH/ are different, because paths are case-sensitive. If you're using URIs are identifiers, you need to know these rules; you need to know how to normalize them.

RFC 2396bis defines the rules you need to know to normalize a URI:

Domain names are case-insensitive (as in the previous example of example.com and EXAMPLE.COM), so convert them all to lowercase.
URIs can contain a port number, but most URI schemes have a default port. For example, the default port for HTTP is 80. So http://example.com:80/ is equivalent to http://example.com/. If the port specified is the default port for that URI scheme, drop it. (Note that this requires knowledge of specific URI schemes.)
Did you know you can also write a URI without a port but with the colon before the port you didn't include? As in http://example.com:/, which is equivalent to http://example.com/.
In some URI schemes you can also include authentication information within the URI, like http://user:password@example.com/. Or simply a user (and get prompted for the password). If the user or password or both are empty, this is equivalent to not specifying them. So http://@example.com/ is equivalent to http://example.com/.
URIs can also contain relative paths, such as http://example.com/foo/../bar/. This is equivalent to http://example.com/bar/.
If the URI has no path (just a domain name), the ending slash is implied. So http://example.com is equivalent to http://example.com/. Note that this is not necessarily the case if a path is present; http://example.com/~smith is not equivalent to http://example.com/~smith/. Some web servers are set up to do a helpful redirect in this case, but for the purposes of identifiers, they are not the same.
Percent-encoded values in the path are case-insensitive. http://example.com/%7esmith/ and http://example.com/%7Esmith are the same. Convert all the percent-encoded values to uppercase.
Some characters are percent-encoded but don't need to be. For example, http://example.com/~smith/ and http://example.com/%7Esmith/ are the same. Decode all percent-encoded characters to their ASCII equivalents.
URIs can also contain percent-encoded non-ASCII characters, for example %C3%87. In RFC 2396, these characters are undefined, since there is no language in the spec about what character encoding should be used to interpret them. RFC 2396bis (the successor-in-progress to RFC 2396) makes it clear that these high-bit characters should be treated as UTF-8. So %C3%87 is the UTF-8 representation of the Unicode character U+00C7, the C-Cedilla (Ç). Such characters need to be percent-decoded and converted to Unicode before comparing them.
But wait, it's even worse than that. Unicode is itself complex, and there are multiple ways to encode the same character. For instance, that C-Cedilla (Ç) character? Could be U+00C7. Could be U+0043 U+0327. The second form is called the "decomposed form" and is a sequence of a capital letter C and a sort of half-character that is the cedilla by itself. They combine to form a single character. So either the composed version (called Unicode Normalized Form C) or the decomposed version (called Unicode Normalized Form D) can be percent-encoded and stored in a URI. Are http://example.com/C%CC%A7 and http://example.com/%C3%87 equivalent? Even RFC 2396bis doesn't say, but after percent decoding, converting from UTF-8, and Unicode normalizing, they're character-for-character the same. Yeah, that's obvious...

One thing is clear: you can't just take two URIs and do strcmp() and compare them byte for byte. URIs aren't just strings; URIs are their own data type, and they have their own comparison rules. This means that, in RSS:

<guid isPermaLink="true">http://EXAMPLE.COM:80/%7Esmith</guid> and
<guid isPermaLink="true">http://example.com/~smith</guid> are the same, because they're URIs (and therefore must be normalized before comparing them), but:

<guid isPermaLink="false">http://EXAMPLE.COM:80/%7Esmith</guid> and
<guid isPermaLink="false">http://example.com/~smith</guid> are different, because they're strings. The only difference is the attribute.

And people wonder why RSS turned my hair gray.

The Compromise

So what did the Atom community finally decide to do with identifiers? By an overwhelming margin we decided to make them required, make them URIs, and make publishers normalize them. RFC 2396bis defines a URI canonical form that takes all of the aforementioned variances into account. Any Atom ID that is not in canonical form is an error.

The downside is for publishers, who must ensure that the IDs they generate are in canonical form. But as it turns it, this isn't much of a burden. Most URIs coming out of publishing tools are in canonical form already, and those that aren't can easily be canonicalized. The feed validator can be updated (and will be updated) to check for non-canonical URIs in atom:id elements, to help highlight and pinpoint bugs in publishing software.

More Dive Into XML Columns

XML on the Web Has Failed

The Atom Link Model

Normalizing Syndicated Feed Content

Atom Authentication

The Atom API

The upside is for clients, because now it becomes much, much easier to compare Atom IDs. Given that atom:id's sole purpose in life is to be compared to other atom:ids, it seemed reasonable to optimize for this. URI-savvy clients that use URI comparison libraries to compare Atom IDs will work fine, but clients that naively use simple-string comparison to compare Atom IDs will also work.

Ultimately, the strongest selling point for this solution was the principle of least surprise. Let's assume you know nothing about the intricacies of RFC 2396bis or URIs or IDs. When you initially look at an Atom feed, you see that every entry has an <id>, and if you bother to read the spec, you'll see that it's defined as "a globally unique identifier for the entry."

At this point, you quite reasonably assume that you can take that value, throw it in a string column of a database (or whatever), and use it to compare to other IDs. But then someone comes along and tells you that simple-string comparison is not enough, that URLs look just like strings but are really their own data type, and that you actually need to write a 10-step function just to compare a thing that isn't used for anything except comparison ... surprise!

Making publishers normalize the URI ahead of time takes away that surprise, and lets clients do what they thought they could do in the first place. You can't do that with RSS. Really Simple Syndication is really only simple if you're doing it incorrectly. We've striving to make Atom simple to do it correctly.