Is A Broken Encoding Declaration An Error?

This sentence represents the outcome of a whole lot of passionate debate. Here's the problem: often, when you transmit something over the network (the two best-known recipients are email and a web browser), you normally send along some extra information saying how it's encoded. This is usually packaged up in something called a MIME header (the "HTTP headers" used on the Web are very similar).

This means that an XML entity might signal its encoding in two different places; internally, with a Byte Order Mark or an encoding declaration, and externally, with one of these headers. While these should normally be in agreement, sometimes they aren't; for example, some Web servers actually change the encoding of some pages as they transmit them; if this (it's called "transcoding") happens, then the encoding declaration would probably become wrong.

Also, some Web servers are just broken, and do stupid things like claiming that everything is in ISO-8859-1 without even checking to see whether this is true. In these cases, the encoding declaration would be right, the header wrong.

Thus this sentence. What it says is that while an incorrect encoding declaration is an error, if an entity comes down the pipe with an external label that allows a processor to read it, then the processor is not allowed to toss it on the floor because of a broken encoding declaration. In other words, the receiver of the entity shouldn't be penalized because an upstream Web server did something stupid.

While this rule is sensible, it can lead to some problems. Suppose some document is in "Shift-JIS" and has an encoding declaration saying so. Then it gets served out by some Web server that translates it into "EUC-JIS" and sends along a header saying (correctly) that it has done so. In this case, presumably the XML processor reads the entity correctly using the header, ignores the (now incorrect) encoding declaration, and everything is fine. Fine, that is, until the user saves the file to the disk. Now it has a broken encoding declaration, but there's no header to help work around the problem. The moral of this story is that the program that read the entity should probably adjust the encoding declaration before saving it.

You can get the same kind of error for an entity which was in pure ASCII and (legally) has no encoding declaration; if it gets transcoded into something else, the absence of the declaration becomes an error.

There is another problem, even more pernicious. It turns out that there are some encoding declaration errors that just can't be detected. For example, if you had a general entity stored in EBCDIC which didn't declare itself, and it was referenced in the middle of an ASCII XML document, if the EBCDIC didn't have any bytes that looked like < or &, then there would be no way for the XML processor to spot the error.

To summarize, the lesson is that the processor should work with whatever information is available to figure out how to decode external entities; and system architects should bear in mind that on the World Wide Web, you can never be absolutely 100% sure of avoiding being bitten by encoding errors.

Back-link to spec

Copyright © 1998, Tim Bray. All rights reserved.