XML::Parser and Character Encodings

April 26, 2000

Overview

Table of Contents
	• Introduction
	• XML::Parser and Character Encodings
	• Encodings in XML::Parser: Examples
	• Conclusion

The XML::Parser model, derived from the expat model, is that no matter what the original document encoding is, the data forwarded to the calling software will be in UTF-8.

Natively, XML::Parser accepts only UTF-8, UTF-16, and ISO-8859-1. In order to be able to process documents in other encodings, you will need to add an encoding table, defined using the XML::Encoding module.

In order to output data in other encodings, you can use the XML::UM module, the Unicode::Strings module, or a Perl substitution (a tr///). Another (riskier but faster) option is to use an XML::Parser method that gives you the original string the parser saw prior to UTF-8 conversion.

We will cover all of these techniques in this section.

Defining new encodings with XML::Encoding

The XML::Encoding module can be used to add more encodings to XML::Parser.

XML::Encoding uses encoding maps that are used by XML::Parser to enable it to parse documents in non-native encodings. Just specifying use XML::Encodings; gives you access to all the encoding maps in the XML::Encoding source directory.

The list of currently defined encodings includes the following: Big5 (traditional Chinese), ISO-8859-2 through 9 (covers all European languages, Cyrillic, Arabic, and Hebrew; it seems, though, that ISO-8859-6, which encodes Arabic characters, cannot be used by XML::Parser), variants of x-euc-jp and x-sjis (both Japanese; make sure you read the Japanese_Encodings.msg file in the XML::Encoding distribution to understand why there are variants and which one you should use). An important encoding that is still missing is the simplified Chinese GB encoding, which is used in China (as opposed to Big5, which is used in Taiwan).

Other encodings can be added, as explained in documentation of XML::Encoding.

The Unicode::String module

The Unicode::String module allows conversion between Unicode and Latin1 strings.

Converting a string from UTF-8 to ISO-8859-1 (Latin1) is very simple:

$u=Unicode::String::utf8($utf8_encoded_string);

$latin1=$u->latin1;

The use of the Unicode::String module is deprecated in Perl 5.6, as there is a simpler way to perform the same operation:

$string =~ tr/\0-\xff//UC;

The XML::UM module

The XML::UM module uses the maps that come with XML::Encoding to perform the reverse operation. It creates mapping routines that encode a UTF-8 string in the chosen encoding.

This module is still in alpha state, but it is certainly worth trying. It would also be worth recoding it in C so it can be faster.

Warning: the version of XML::UM in libxml-enno-1.02 has an installation problem. To fix this, once you have downloaded and uncompressed the module, before doing perl Makefile.PL, edit the XML::UM.pm file in the lib/XML directory and replace the $ENCDIR value with the location of your XML::Encoding maps (it should be /usr/local/src/XML-Encoding-1.01/maps or /opt/src/XML-Encoding-1.01/maps/).

A typical usage scenario would be:

# create the encoding routine (only once!)

$encode= XML::UM::get_encode(Encoding => 'big5');



# convert $utf8_string to big5

$encoded_string= $encode($utf8_string);

The XML::Code module

An interesting way to encode documents, at least for Western languages like French, German, or Spanish, is to use only basic ASCII (characters in the 0-127 range) and encode everything else using a character reference (an entity whose name is the Unicode code point of the character: an e-acute becomes "é", for example). One important limitation is that XML prohibits character reference in element and attribute names, so you are limited to basic ASCII there. Still, this is an easy way to work with Unicode characters in Perl and still be able to store the documents in a non-Unicode aware system.

I know of no module on CPAN that offers this functionality in a standalone fashion, but Enno Derksen's XML::DOM incorporates it. So, I just extracted the code from DOM.pm and packaged it in XML::Code. This is a very simple module that just encodes CDATA, PCDATA, and tags. It should actually be part of XML::UM, and will be in the future.

Using the original_string method

If all else fails, and if you don't need to use regular expressions, a last resort method is to use the original_string method of XML::Parser.

Called from a handler, this method returns the exact string matched by the parser before entity expansion and UTF-8 conversion. (The recognized_string method returns the string after those two operations have taken place.)

Now the bad news:

An important restriction is that expat has to be able to parse the document. This means that in any case you need an encoding map for the document encoding.
You might not be able to rely on regular expressions any more: the regexp engine assumes that characters are single-byte. If they are not, it might get completely confused.
If you are using non UTF-8 element names, or attribute names, or even attribute values, then you are no longer able to rely on XML::Parser to parse tags, and you will have to write the piece of code that parses a tag in order to extract the element name and the attributes -- still with the risk that the regular expressions you use for that may break on multi-byte characters.

Although using this methods seems like a desperate measure, it's actually not necessarily that bad: a surprising number of XML applications don't use regular expressions. After all, a parsed document is already split into small chunks, and processing it often consists of moving those chunks around and changing their properties, so you might be able to live (dangerously) with those restrictions.