XML.com: XML From the Inside Out

XML.comWebServices.XML.comO'Reilly Networkoreilly.com
  Articles | Weblogs | Newsletter | Safari Bookshelf
advertisement

Subject: Wrong Assumptions, Wrong Facts, Wrong Conclusions
Date: 2004-07-23 01:02:14
From: rjelliffe

This article is 100% upside down.


It is not XML that has failed, it is that the encoding declarations on text/* has failed: both for the practical reason that browser vendors do not respect it (they variously use guesswork, hard-coded settings, selection from the last few encodings encountered, defaults and protocol settings\) and for the reason that APIs that generate text/* do not integrate into web browsers (e.g. to produce the necessary local config file for Apache that would tell it what encoding label to use.)


That text/* was unworkable in this regard is not news: in fact, it was the reason that XML had to introduce its own mechanism in the first place. When I proposed the encoding algorthm for XML, this was one of the problems we were trying to fix.


Similarly, when we worked on RFC 3023, some people wanted to make xml use text only, but during discussion the various characteristics of text/* and application/* became apparant. In particular, note that RFC 3023 recommends against using text/* for any XML that is not meant to be read as an "unprocessed, source XML document" by "casual users": almost no XML should fall into that category. Anyone who sends XML intended for use by an agent, or to be read as text by non-casual humans, as text/* is not following RFC 3023. (The particular problems of transcoding of Japanese data were, of course, very clear in the mind of Murata Makoto, on of the editors of that RFC.)


Recently, Sir Tim and the W3C TAG have made this even clearer, recommending "In general, a representation provider SHOULD NOT assign Internet media types beginning with "text/" to XML representations." See http://www.w3.org/TR/webarch/#no-text-xml


The XML recommendation also now gives the recommendation "If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding."
(I would class a representation at a URL as a kind of file: a named sequence of bytes; as distinct from characters in RAM, for example.)


So given that XML was designed because text/* was known to be unworkable 10 years ago, and that RFC 3023 recommends against using text/* for things like RSS, and that the TAG has recently reaffirmed this, what the hell is the opening quote about? We didn't "miss it"!


It is not a problem for XML or RFC 3023 that MIME encodings are frequently/usually wrong. That was the status quo when we developed XML: the innovation in XML was to provide a more reliable mechanism rather than to sit around expecting people to get their webservers to label their data correctly. (Indeed, that is another problem with text/*: the people who produce files or text streams are often not the same people authorized to set up the webserver. It creates a co-ordination problem that application/* with no encoding label does not have.)


So I think the premise of the article (that text/* is workable, therefore any problem is XML's) is wrong, and the conclusion (that we are all forced to use ASCII) is doubly wrong. Follow what the XML recommendation says, what RFC 3023 says, and what the TAG says: use application/* and stick to the encoding declaration.


Figures demonstrating that people cannot set their web servers correctly do not undermine XML: that was one of the very reasons why we introduced the encoding declaration in the first place! On the contrary, such figures show that our approach was correct.


The article says "The entire world of syndication only works because everyone happens to ignore the rules in the same way. So much for ensuring interoperability." Err, yes. XML was born into a world where those rules were not being followed, and there was no chance that they would be or in the future will be, reliably. Isn't that the point that RFC 3023 makes?


There is another wrinkle: when XML is sent text/* and the encoding declaration from any source does not accord with the byte patterns found, then the document is not well-formed. The more that there are redundant (i.e. spare) code points, the more chance that these kind of errors can be detected. This is why XML 1.1 got is exactly right in banning the C1 control characters U+0080 to U+9A from appearing literally: doing so creates enough redunduncy to detect many common encoding problems.


One other point: it seems to me that the "facts" are unreliable because the methodology is bogus. If 80% of the feeds at Syndic8 are English (or Bahasa Indonesia, Bahasa Malasia, Swahili, Hawaiin and the other ASCII-using languages) then of course you will get results that 80% of XML documents use ASCII only. You simply cannot say that because one American English-language site with mainly American English RSS feeds does not have many non-ASCII XML data, therefore non-ASCII XML has failed. If I went to a Japanese aggregator and found that no-one delivered ASCII-only XML, I would not be justified in claiming that XML has failed for ASCII data!


Another real mistake in the article comes at the end: Mark invents a first promise of XML "publish in any character encoding". As the guy who proposed XML's character encoding-detection algorithm, XML 1.0's ERCS-based naming rules, XML 1.1's control exclusion rules, XML's hex numeric character references and (with Gavin Nicol) the idea of using any character encoding on top of Unicode, I think I am qualified to stress that guaranteed interoperability of XML documents in any encoding was never the intent nor the promise of XML: XML processors only need to support ASCII, UTF-8 and UTF-16, after all.


Previous Message Previous Message   Next Message Next Message

Sponsored By:


Contact Us | Our Mission | Privacy Policy | Advertise With Us | | Submissions Guidelines
Copyright © 2008 O'Reilly Media, Inc. | (707) 827-7000 / (800) 998-9938