Character Encodings in XML and Perl
|Table of Contents|
This article examines the handling of character encodings in XML and Perl. I will look at what character encodings are and what their relationship to XML is. We will then move on to how encodings are handled in Perl, and end with some practical examples of translating between encodings.
Encodings! The hidden face of XML. For most people, at least here in the US, XML is simply a data format that specifies elements and attributes, and how to write them properly in a nice tree structure.
But the truth is that, in order to encode text or data, you first need to specify an encoding for it. The most common of all encodings (at least in Western countries) is without a doubt ASCII. Other encodings you may have come across include the following: EBCDIC, which will remind some of you of the good old days when computer and IBM meant the same thing; Shift-JIS, one of the encodings used for Japanese characters; and Big 5, a Chinese encoding.
What all of these encodings have in common is that they are largely incompatible. There are very good reasons for this, the first being that Western languages can live with 256 characters, encoded in 8-bits, while Eastern languages use many more, thus requiring multi-byte encodings. Recently, a new standard was created to replace all of those various encodings: Unicode, a.k.a. ISO 10646. (Actually they are two different standards -- ISO 10646 from ISO and Unicode from the Unicode consortium -- but they are so close that we can consider them equivalent for most purposes.)
Unicode and UTF-8
Unicode is a standard that aims to replace all other character encodings by providing an all-encompassing, yet extensible, scheme. It includes characters in Western as well as Asian languages, plus a whole range of mathematical and technical symbols (no more GIFs for Greek letters!), as well as extension mechanisms to add even more characters when needed (apparently new Chinese characters are created each year).
On Unix systems, Unicode is usually encoded in UTF-8, which is another layer of encoding that allows any POSIX compliant system to process it. UTF-8 characters also display properly in any recent web browser, no matter the platform, provided that the Unicode fonts installed on your computer include the appropriate characters. A useful page to look at is this UTF-8 Sampler.
Most systems now come with Unicode fonts, but don't be too excited -- a lot of those fonts are incomplete, tending to cover only the usual Western character sets. The default Windows 98 Unicode fonts, for example, include exactly one Asian character, although they do include Hebrew. (Unfortunately, Windows displays UTF-8 Hebrew left-to-right instead of right-to-left.)
Incidentally, much to the delight of our English and American readers, the first 128 characters of the ASCII encoding happen to be identical to UTF-8, thus requiring no conversion, no special processing... nothing! It just works.
My purpose here is not to describe Unicode in great detail, so if you want more information, have a look at the Unicode Home Page, or at the UTF-8 and Unicode FAQ for Unix/Linux, which gives plenty of information on the subject.
Encodings and XML
The XML specification (which you can find on this web site, along with Tim Bray's excellent comments) does not force you to use Unicode. You can declare any encoding in the list defined by IANA (the Internet Assigned Numbers Authority).
The only small problem is that XML processors, as per the specification, are only required to understand two encodings: UTF-8 and UTF-16 (an alternate way of encoding Unicode characters using two bytes).
For example, expat, the parser used by the XML::Parser module, understands natively UTF-8, UTF-16, and also ISO-8859-1 (also known as ISO Latin 1), which covers most of Western European and African languages, with the obvious exception of Arabic. It can be extended to accept even more encodings, but more on that later.
An important feature of expat is that it converts any string it parses into UTF-8. No matter what you put in, as long as it is in a known encoding, what you get out is UTF-8.
Perl and Unicode
Perl currently (as of 5.6) offers full UTF-8 support. This means that, among other things, you can now use regular expressions on UTF-8 strings without having to worry about multi-byte characters wreaking havoc with your processing.
All the functions that assumed a character was one byte and one byte only have been updated to behave properly even when working with characters of different lengths. For a more detailed description of what Unicode support means to Perl (and some of its shortcomings), look at Simon Cozens' "What's New in Perl 5.6.0?"
So everything is fine, right? We get our documents in Unicode, at worst in some other encoding known by expat, which converts it to UTF-8, which we then process using a Unicode-aware Perl... easy! Well, not quite.
Many XML applications interface with other software, DBMSs, text editors, etc., that do not grok Unicode. So the XML applications will need to accept non-Unicode input, and often output non-Unicode too, to feed it back to the rest of the environment.
In the next section we will look at the Perl tools we can use to work with documents in various encodings.