XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

"There must have been a moment, at the beginning, where we could have said ... no. Somehow we missed it. Well, we'll know better next time."
"Until then ..."
-- Rosencrantz and Guildenstern are Dead

Introduction

XML is predicated on two main strengths. First, publishers can use any character in the world, natively and efficiently, by declaring the character encoding that makes the most sense for their content. English publishers can use an ASCII encoding and store each character in one byte. But Russian publishers can use an encoding such as KOI8-R, and they can store their Cyrillic characters each in one byte. Publishers that don't know what language they'll be dealing with can use "universal" encodings like UTF-8 or UTF-16. Second, consumers can use off-the-shelf parsing libraries that can assume well-formed XML as input.

Finally an end to the "tag soup" hell that infected the HTML publishing world, where every parser handles ill-formed markup differently and nobody is wrong. Draconian error handling, we were and are assured, is crucial for interoperability.

XML has had many successes. I just finished writing a 400-page book in DocBook XML. It is over one megabyte of well-formed XML, and I use off-the-shelf tools to transform it into HTML, PDF, and a number of other formats. This was one of the original use cases for XML, way back in 1997 when SGML ruled the Earth and dinosaurs like ISO-8879 were gospel. It works; I love it; I recommend it for any serious content creator.

The other apparent success of XML is the rise of syndicated feeds, in a range of XML vocabularies like CDF, RSS, and now Atom. There are feed directories; there are feed search engines; there are feed validators. Every major blogging tool -- from freeware like Blogger to commercial software like Movable Type to open source software like WordPress -- every tool publishes at least one of these XML formats.

Syndicated feeds are wildly popular, but they're not a success for XML. XML on the Web has failed: miserably, utterly, and completely.

The Primacy of HTTP

What's the main strength of XML for publishers? Publish in any character encoding. I use ASCII, you use Big-5 (Chinese), he uses EUC-KR (Korean). XML handles them all. Just declare your encoding at the top of your feed, like this:

<?xml version="1.0" encoding="koi8-r"?>

But not quite. That's the way it works when an XML document is stored on disk. But HTTP has its own method of determining character encoding, that looks like this:

Content-Type: text/xml; charset="koi8-r"

So if we're serving XML over HTTP, suddenly we have two places to look for the character encoding. Not a problem if they're both the same, but what if they're not? The answer is simple and universal: when an XML document is served over HTTP, HTTP's method always wins. The XML specification even admits this, in appendix F:

When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. In particular, please refer to [IETF RFC 3023] or its successor, which defines the text/xml and application/xml MIME types and provides some useful guidance.

But wait, there's more. In both HTTP and XML, the encoding declaration is optional. So now we run into cases where the XML file declares an encoding but the HTTP headers do not or vice-versa. This is actually very common, especially in the case of feeds. Why? Many publishing systems don't generate feeds dynamically. Instead, they generate the feed once, and then they cache it in a file (probably with a ".xml" or ".rdf" extension) and let the underlying web server actually serve it when anyone comes along and asks for it.

This is efficient, since feeds are requested over and over, and the web server is much faster at serving static files than it is at calling a script that happens to spit out a feed as output. It may also be required. Consider the case of Blogger: the publishing system is on a central server, but the feeds (and the rest of your site) are served from a completely different server. (When you make a change, Blogger regenerates all the necessary files and transfers them to your server over FTP.)

Stay with me, this is important.

So we're being efficient by caching feeds as static files. But the underlying web server doesn't know anything about the encoding preferences of one specific application, so it always serves ".xml" files without an explicit character encoding in the HTTP headers. That sounds like a good thing. It's certainly faster and less error-prone than introspecting into the XML document each time before serving it. And presumably the publishing system knows the correct character encoding, so it should have stored it in the encoding attribute in the first line of the XML document when it cached the feed. The encoding isn't getting lost, it's just stored in one place instead of another place.

So what's the problem?

Enter RFC 3023

RFC 3023 (XML Media Types) is the problem. RFC 3023 is a valiant attempt to sort out the precedence mess between HTTP and XML, while still respecting all the other software that allows the Internet to work. (More on that in a minute.) RFC 3023 is the one and only place to learn how to determine the character encoding of an XML document served over HTTP.

According to RFC 3023, if the media type given in the Content-Type HTTP header is application/xml, application/xml-dtd, application/xml-external-parsed-entity, or any one of the subtypes of application/xml such as application/atom+xml or application/rss+xml or even application/rdf+xml, then the character encoding is determined in this order:

  1. the encoding given in the charset parameter of the Content-Type HTTP header, or
  2. the encoding given in the encoding attribute of the XML declaration within the document, or
  3. utf-8.

On the other hand, if the media type given in the Content-Type HTTP header is text/xml, text/xml-external-parsed-entity, or a subtype like text/AnythingAtAll+xml, then the encoding attribute of the XML declaration within the document is ignored completely, and the character encoding is:

  1. the encoding given in the charset parameter of the Content-Type HTTP header, or
  2. us-ascii.

There is actually a good reason for this second set of rules. There are things called "transcoding proxies," used by ISPs and large organizations in Japan and Russia and other countries. A transcoding proxy will automatically convert text documents from one character encoding to another. If a feed is served as text/xml, the proxy treats it like any other text document, and transcodes it. It does this strictly at the HTTP level: it gets the current encoding from the HTTP headers, transcodes the document byte for byte, sets the charset parameter in the HTTP headers, and sends the document on its way. It never looks inside the document, so it doesn't know anything about this secret place inside the document where XML just happens to store encoding information.

So there's a good reason, but this means that in some cases -- such as feeds served as text/xml -- the encoding attribute in the XML document is completely ignored.

Strike 1.

The Curse of Default Content-Types

Remember that scenario of caching feeds as static ".xml" files, and letting the web server serve it later? Well, until very recently, every web server on Earth was configured by default to send static ".xml" files with a Content-Type of text/xml with no charset parameter. In fact, Microsoft IIS still does this. Apache was fixed last November, but depending on your configuration, the fix may not be installed over an existing installation when you upgrade.

According to RFC 3023, every single one of those cached feeds being served as text/xml has a character encoding of us-ascii, not the encoding declared in the XML declaration.

How bad is this problem? I recently surveyed 5,096 active feeds from Syndic8.com. The results were astonishing.

Description Status Number Percentage
text/plain ill-formed 1064 21%
text/xml and non-ASCII ill-formed 961 19%
Mismatched tags ill-formed 206 4%
text/xml but only used ASCII well-formed 2491 49%
Other well-formed 374 7%

Over 20% of the feeds I surveyed weren't even served with an XML content type, so they're automatically ill-formed. The vast majority of the rest were served as text/xml with no charset parameter. These could be well-formed, but RFC 3023 says they must be treated as us-ascii. But many contained non-ASCII characters, so they were not well-formed. They were properly encoded, but not properly served. That is, they were encoded in some non-ASCII character encoding that the publisher carefully declared in the XML declaration -- which must be ignored because of the HTTP Content-Type.

Strike 2.

The Tools Will Save Us

More Dive Into XML Columns

Identifying Atom

The Atom Link Model

Normalizing Syndicated Feed Content

Atom Authentication

The Atom API

There's one more piece to this puzzle: client software. Remember those wonderful "off-the-shelf" libraries that are supposed to reject ill-formed XML documents? After all, "draconian error handling" is one of the fundamental principles of XML. Since determining the actual character encoding of an XML document is a prerequisite for determining its well-formedness, you would think that every XML parser on Earth supported RFC 3023.

And you'd be wrong. Dead wrong, in fact. None of the most popular XML parsers support RFC 3023. Many can download XML documents over HTTP, but they don't look at the Content-Type header to determine the character encoding. So they all get it wrong in the case where the feed is served as text/* and the HTTP encoding is different from the XML encoding.

Which is the case 90% of the time. Well, 89%, but who's counting?

Strike 3.

Postel's Law Has Two Parts

Postel's Law was defined in RFC 793 as "be conservative in what you do, be liberal in what you accept from others."

The first part of Postel's Law applies to XML. Publishers are responsible for generating well-formed XML. But think about what that actually requires. Think about Blogger, generating a feed, and moving it to an unknown server that's not under Blogger's control. If publishers want to be strict in what they produce, and they want or need to publish feeds as static files, then they need to programmatically ensure that the feeds will still be valid if they were later served as text/xml.

In other words, publishing systems need to encode all of their feeds as us-ascii. (They could use character entities to represent non-ASCII characters, but what a cost! 4-8 bytes per character, instead of 1. That's some serious bloat.) So much for the promise of being able to use any character encoding you want.

But Postel's Law has two parts, and XML was supposed to take the second half ("be liberal in what you accept from others") and throw it out the window. Clients are supposed to be strict in what they accept, and reject any XML that's not well-formed.

But clients are not being strict. They're all liberal; they're all buggy. Remember those 961 feeds I surveyed, that were served as text/xml but were actually using non-ASCII characters? And the 1,064 feeds served as text/plain? All the popular XML client libraries say they're well-formed, even though they're not. So much for draconian error handling.

What's worse, most publishers aren't aware of it. Publishers rely on the widespread bugginess of the most popular XML client libraries, and they continue to blithely publish XML documents over HTTP with a Content-Type that should downgrade their feeds to us-ascii, but in reality doesn't. They're not being strict in what they publish, but it works out, because clients aren't being strict in what they accept.

The entire world of syndication only works because everyone happens to ignore the rules in the same way. So much for ensuring interoperability.

Conclusion

XML on the Web has failed. Miserably, utterly, completely. Multiple specifications and circumstances conspire to force 90% of the world into publishing XML as ASCII. But further widespread bugginess counteracts this accidental conspiracy, and allows XML to fulfill its first promise ("publish in any character encoding") at the expense of the second ("draconian error handling will save us").

Let this fester for a few years and suddenly we find that we couldn't fix the problem even if we wanted to. Forty percent of the syndicated feeds in the world would simply disappear if clients suddenly started living up to their end of XML's anti-Postel bargain.

Perhaps XML's successor can do better. "Well, we'll know better next time ... Until then ..."



1 to 16 of 16
1 to 16 of 16