Language Identification Is Vital

XML documents are made up mostly of text, which in most cases, is used to carry a message encoded in human language. It turns out that in practical terms, you can do very little that's useful with text if you don't know what language it's written in. Things you can't do properly in a language-oblivious way include:

This may be, on the face of it, rather surprising; doesn't XML use Unicode, and doesn't Unicode encode all the world's characters in an unambiguous way? That's true, and in fact the use of Unicode is one of XML's big advantages; but nonetheless, experience has shown that you still need to know about languages. That's why XML has the xml:lang attribute.

Back-link to spec

Copyright © 1998, Tim Bray. All rights reserved.