Character Encodings in XML and Perl

April 26, 2000

Introduction

Table of Contents

• Introduction
• XML::Parser and Character Encodings
• Encodings in XML::Parser: Examples
• Conclusion

This article examines the handling of character encodings in XML and Perl. I will look at what character encodings are and what their relationship to XML is. We will then move on to how encodings are handled in Perl, and end with some practical examples of translating between encodings.

Character Encodings

Encodings! The hidden face of XML. For most people, at least here in the US, XML is simply a data format that specifies elements and attributes, and how to write them properly in a nice tree structure.

But the truth is that, in order to encode text or data, you first need to specify an encoding for it. The most common of all encodings (at least in Western countries) is without a doubt ASCII. Other encodings you may have come across include the following: EBCDIC, which will remind some of you of the good old days when computer and IBM meant the same thing; Shift-JIS, one of the encodings used for Japanese characters; and Big 5, a Chinese encoding.

What all of these encodings have in common is that they are largely incompatible. There are very good reasons for this, the first being that Western languages can live with 256 characters, encoded in 8-bits, while Eastern languages use many more, thus requiring multi-byte encodings. Recently, a new standard was created to replace all of those various encodings: Unicode, a.k.a. ISO 10646. (Actually they are two different standards -- ISO 10646 from ISO and Unicode from the Unicode consortium -- but they are so close that we can consider them equivalent for most purposes.)

Unicode and UTF-8

Unicode is a standard that aims to replace all other character encodings by providing an all-encompassing, yet extensible, scheme. It includes characters in Western as well as Asian languages, plus a whole range of mathematical and technical symbols (no more GIFs for Greek letters!), as well as extension mechanisms to add even more characters when needed (apparently new Chinese characters are created each year).

On Unix systems, Unicode is usually encoded in UTF-8, which is another layer of encoding that allows any POSIX compliant system to process it. UTF-8 characters also display properly in any recent web browser, no matter the platform, provided that the Unicode fonts installed on your computer include the appropriate characters. A useful page to look at is this UTF-8 Sampler.

Most systems now come with Unicode fonts, but don't be too excited -- a lot of those fonts are incomplete, tending to cover only the usual Western character sets. The default Windows 98 Unicode fonts, for example, include exactly one Asian character, although they do include Hebrew. (Unfortunately, Windows displays UTF-8 Hebrew left-to-right instead of right-to-left.)

Incidentally, much to the delight of our English and American readers, the first 128 characters of the ASCII encoding happen to be identical to UTF-8, thus requiring no conversion, no special processing... nothing! It just works.

My purpose here is not to describe Unicode in great detail, so if you want more information, have a look at the Unicode Home Page, or at the UTF-8 and Unicode FAQ for Unix/Linux, which gives plenty of information on the subject.

Encodings and XML

The XML specification (which you can find on this web site, along with Tim Bray's excellent comments) does not force you to use Unicode. You can declare any encoding in the list defined by IANA (the Internet Assigned Numbers Authority).

The only small problem is that XML processors, as per the specification, are only required to understand two encodings: UTF-8 and UTF-16 (an alternate way of encoding Unicode characters using two bytes).

For example, expat, the parser used by the XML::Parser module, understands natively UTF-8, UTF-16, and also ISO-8859-1 (also known as ISO Latin 1), which covers most of Western European and African languages, with the obvious exception of Arabic. It can be extended to accept even more encodings, but more on that later.

An important feature of expat is that it converts any string it parses into UTF-8. No matter what you put in, as long as it is in a known encoding, what you get out is UTF-8.

Perl and Unicode

Perl currently (as of 5.6) offers full UTF-8 support. This means that, among other things, you can now use regular expressions on UTF-8 strings without having to worry about multi-byte characters wreaking havoc with your processing.

All the functions that assumed a character was one byte and one byte only have been updated to behave properly even when working with characters of different lengths. For a more detailed description of what Unicode support means to Perl (and some of its shortcomings), look at Simon Cozens' "What's New in Perl 5.6.0?"

So everything is fine, right? We get our documents in Unicode, at worst in some other encoding known by expat, which converts it to UTF-8, which we then process using a Unicode-aware Perl... easy! Well, not quite.

Many XML applications interface with other software, DBMSs, text editors, etc., that do not grok Unicode. So the XML applications will need to accept non-Unicode input, and often output non-Unicode too, to feed it back to the rest of the environment.

In the next section we will look at the Perl tools we can use to work with documents in various encodings.

Overview

Table of Contents

• Introduction
• XML::Parser and Character Encodings
• Encodings in XML::Parser: Examples
• Conclusion

The XML::Parser model, derived from the expat model, is that no matter what the original document encoding is, the data forwarded to the calling software will be in UTF-8.

Natively, XML::Parser accepts only UTF-8, UTF-16, and ISO-8859-1. In order to be able to process documents in other encodings, you will need to add an encoding table, defined using the XML::Encoding module.

In order to output data in other encodings, you can use the XML::UM module, the Unicode::Strings module, or a Perl substitution (a tr///). Another (riskier but faster) option is to use an XML::Parser method that gives you the original string the parser saw prior to UTF-8 conversion.

We will cover all of these techniques in this section.

Defining new encodings with XML::Encoding

The XML::Encoding module can be used to add more encodings to XML::Parser.

XML::Encoding uses encoding maps that are used by XML::Parser to enable it to parse documents in non-native encodings. Just specifying use XML::Encodings; gives you access to all the encoding maps in the XML::Encoding source directory.

The list of currently defined encodings includes the following: Big5 (traditional Chinese), ISO-8859-2 through 9 (covers all European languages, Cyrillic, Arabic, and Hebrew; it seems, though, that ISO-8859-6, which encodes Arabic characters, cannot be used by XML::Parser), variants of x-euc-jp and x-sjis (both Japanese; make sure you read the Japanese_Encodings.msg file in the XML::Encoding distribution to understand why there are variants and which one you should use). An important encoding that is still missing is the simplified Chinese GB encoding, which is used in China (as opposed to Big5, which is used in Taiwan).

Other encodings can be added, as explained in documentation of XML::Encoding.

The Unicode::String module

The Unicode::String module allows conversion between Unicode and Latin1 strings.

Converting a string from UTF-8 to ISO-8859-1 (Latin1) is very simple:

$u=Unicode::String::utf8($utf8_encoded_string);

$latin1=$u->latin1;

The use of the Unicode::String module is deprecated in Perl 5.6, as there is a simpler way to perform the same operation:

$string =~ tr/\0-\xff//UC;

The XML::UM module

The XML::UM module uses the maps that come with XML::Encoding to perform the reverse operation. It creates mapping routines that encode a UTF-8 string in the chosen encoding.

This module is still in alpha state, but it is certainly worth trying. It would also be worth recoding it in C so it can be faster.

Warning: the version of XML::UM in libxml-enno-1.02 has an installation problem. To fix this, once you have downloaded and uncompressed the module, before doing perl Makefile.PL, edit the XML::UM.pm file in the lib/XML directory and replace the $ENCDIR value with the location of your XML::Encoding maps (it should be /usr/local/src/XML-Encoding-1.01/maps or /opt/src/XML-Encoding-1.01/maps/).

A typical usage scenario would be:

# create the encoding routine (only once!)

$encode= XML::UM::get_encode(Encoding => 'big5');



# convert $utf8_string to big5

$encoded_string= $encode($utf8_string);

The XML::Code module

An interesting way to encode documents, at least for Western languages like French, German, or Spanish, is to use only basic ASCII (characters in the 0-127 range) and encode everything else using a character reference (an entity whose name is the Unicode code point of the character: an e-acute becomes "é", for example). One important limitation is that XML prohibits character reference in element and attribute names, so you are limited to basic ASCII there. Still, this is an easy way to work with Unicode characters in Perl and still be able to store the documents in a non-Unicode aware system.

I know of no module on CPAN that offers this functionality in a standalone fashion, but Enno Derksen's XML::DOM incorporates it. So, I just extracted the code from DOM.pm and packaged it in XML::Code. This is a very simple module that just encodes CDATA, PCDATA, and tags. It should actually be part of XML::UM, and will be in the future.

Using the original_string method

If all else fails, and if you don't need to use regular expressions, a last resort method is to use the original_string method of XML::Parser.

Called from a handler, this method returns the exact string matched by the parser before entity expansion and UTF-8 conversion. (The recognized_string method returns the string after those two operations have taken place.)

Now the bad news:

An important restriction is that expat has to be able to parse the document. This means that in any case you need an encoding map for the document encoding.
You might not be able to rely on regular expressions any more: the regexp engine assumes that characters are single-byte. If they are not, it might get completely confused.
If you are using non UTF-8 element names, or attribute names, or even attribute values, then you are no longer able to rely on XML::Parser to parse tags, and you will have to write the piece of code that parses a tag in order to extract the element name and the attributes -- still with the risk that the regular expressions you use for that may break on multi-byte characters.

Although using this methods seems like a desperate measure, it's actually not necessarily that bad: a surprising number of XML applications don't use regular expressions. After all, a parsed document is already split into small chunks, and processing it often consists of moving those chunks around and changing their properties, so you might be able to live (dangerously) with those restrictions.

Table of Contents

• Introduction
• XML::Parser and Character Encodings
• • Encodings in XML::Parser: Examples
• Conclusion

It's time for some practical examples. All of these examples have one goal: take a document in and output it back. The input and output documents are identical, except that the encoding might be different. They all use only XML::Parser, with no style set.

These examples are not much use as-is, but they will give you a feel of what dealing with various encodings means, and can be used as a starting point when developing with XML::Parser.

Note: this section requires a basic knowledge of XML::Parser. For an introduction to this module, see Clark Cooper's article Using The Perl XML::Parser Module.

The Documents

I created three documents, one including Western characters (French), one including Japanese characters, and another one in Chinese. In order to allow viewing of the examples without playing with the character encodings in your browser, the documents are displayed as graphics. I then saved them in a variety of encodings.

The French example:

The Chinese example:

The Japanese example:

Basic Example (UTF-8 in, UTF-8 out)

Script: enc_ex1.pl
Input documents: doc_fr_utf.xml, doc_ch_utf.xml, or doc_jp_utf.xml

This example doesn't do much -- it just takes the input document and writes it as-is to the output -- but at least it parses the document. It also demonstrates the framework of what we are going to do in the next examples.

#!/bin/perl -w

use strict;

use XML::Parser;

my $p= new XML::Parser( Handlers =>

                         { Start   => \&default,

                           End     => \&default,

                           Default => \&default

                         },

                      );

$p->parsefile($ARGV[0]);

exit;



# by default print the UTF-8 encoded string received from the parser

sub default

  { my $p= shift;

    my $string= $p->recognized_string();

    print $string;

  }

Arbitrary Encoding to UTF-8

This example is pretty similar to the previous one, but with the added processing complexity of needing to remove the original encoding declaration from the XML declaration to reflect the fact that the output encoding is UTF-8.

Script: enc_ex2.pl
Input documents: doc_fr_latin1.xml (there is no need to use XML::Encoding with this one), doc_ch_big5.xml, doc_jp_sjis.xml, or doc_jp_euc.xml

#!/bin/perl -w

use strict;

use XML::Parser;

use XML::Encoding;

my $p= new XML::Parser( Handlers =>

                         { Start   => \&default,

                           End     => \&default,

                           Default => \&default,

                           XMLDecl => sub { decl( "UTF-8", @_); }

                         },

                      );

$p->parsefile($ARGV[0]);

exit;



# update the encoding

sub decl

  { my( $new_encoding, $p, $version, $encoding, $standalone)=@_;

    print "<?xml version=\"$version\" encoding=\"$new_encoding\"";

    print "standalone=\"$standalone\"" if( $standalone);

    print "?>\n";

  }



# by default print the UTF-8 encoded string received from the parser

sub default

  { my $p= shift;

    my $string= $p->recognized_string();    

    print $string;

  }

Parsing HTML-like Documents

In this next example we will need to parse a document in basic ASCII (hence it is also a UTF-8 document), which includes non-UTF-8 encoded entities such as é. This often happens when dealing with XHTML, or just "slightly-enhanced HTML" documents, so this is quite a common problem.

XML has no built-in notion of these entities, so we need to include an entity declaration file that will just convert those to UTF-8. I usually use a single file named html2utf8.ent, which I grabbed from the W3C Character entity references. Note that this file is in no way authoritative, just convenient. Here is how these entities are referenced at the start of an XML document:


<!DOCTYPE doc [

  <!ENTITY % html2utf8 SYSTEM "html2utf8.ent">

 %html2utf8;

 ]>

An important restriction of this type of document is that element and attribute names are restricted to basic ASCII. So the document in doc_fr_html.xml is a little different from the others.

Here is the script: enc_ex3.pl.

#!/bin/perl -w

use strict;

use XML::Parser;

my $p= new XML::Parser( Handlers =>

                         { Start => \&default,

                           End   => \&default,

                           Char  => \&default, 

                         },

                        ParseParamEnt => 1

                      );

$p->parsefile($ARGV[0]);

exit;



# by default print the UTF-8 encoded string received from the parser

sub default

  { my $p= shift;

    my $string= $p->recognized_string();

    print $string;

  }

Unicode In, Latin 1 Out

The script (enc_ex4.pl) can be written in one of two ways, depending on whether you are using Perl 5.6 or an older version. You can use it with this document: doc_fr_latin1.xml.

For pre-5.6 Perl you will need to use the Unicode::String module:

#!/bin/perl -w

use strict;

use XML::Parser;

use Unicode::String;



my $p= new XML::Parser( Handlers =>

                 { Start   => \&default_latin1,

                   End     => \&default_latin1,

                   XMLDecl => sub { decl( "ISO-8859-1",@_); },

                   Default => \&default_latin1

                         },

                      );

$p->parsefile($ARGV[0]);

exit;



sub decl

  { my( $new_encoding, $p, $version, $encoding, $standalone)=@_;

    print "<?xml version=\"$version\" encoding=\"$new_encoding\"";

    print "standalone=\"$standalone\"" if( $standalone);

    print "?>\n";

  }



sub default_latin1                          

  { my $p= shift;

    my $string= $p->recognized_string();   # get the UTF8 string

    my $u= Unicode::String::utf8( $string);# create Unicode::String 

    print $u->latin1;                      # convert string to latin1

  }

With Perl 5.6, you don't need to use the Unicode::String module and can use a tr/// to convert the character, so default_latin1 becomes:

sub default_latin1                          

  { my $p= shift;3

    my $string= $p->recognized_string();   # get the UTF8 string

    my $string=~ tr/\0-\x{ff}//UC;         # convert string to latin1

    print $string;

  }

Non-Unicode Encoding In, Same Encoding Out

If your original document is in Latin 1, you can use the previous script; if it is in some other encoding, you will have to use XML::UM, as demonstrated by the following script (enc_ex5.pl):

#!/bin/perl -w

use strict;

use XML::Parser;

use XML::Encoding;

use XML::UM;



# we need to tell XML::UM where the encoding maps are

$XML::UM::ENCDIR="/usr/local/src/XML-Encoding-1.01/maps/";



my $encode; # the encoding function, created in the XMLDecl handler



my $p= new XML::Parser( Handlers =>

                         { Start   => \&default,

                           End     => \&default,

                           XMLDecl => \&decl,

                           Default => \&default,

                         },

                      );

$p->parsefile($ARGV[0]);

exit;



sub default

  { my $p= shift;

    print $encode->($p->recognized_string());

  }



# get the encoding and create the encode function

sub decl

  { my($p, $version, $encoding, $standalone)=@_;

    print "<?xml version=\"$version\" encoding=\"$encoding\"";

    print "standalone=\"$standalone\"" if( $standalone);

    print "?>\n";

    $encode= XML::UM::get_encode( Encoding => $encoding);

  }

Conclusion and Resources

Table of Contents

• Introduction
• XML::Parser and Character Encodings
• Encodings in XML::Parser: Examples
• • Conclusion

Unicode is really cool. I mean really cool! I can't begin to tell you what a pain it is to deal with special characters -- even something as trivial as accented characters in a name -- in a non-Unicode environment. And using GIF images for Greek letters is really not satisfying.

So, the way to go is really to try to use Unicode as much as possible. If your environment is not Unicode-enabled, your first priority should be to try to upgrade your tools to get a fully Unicode system. Make it a criteria when you get new tools (pressure your vendors to add Unicode support). It will save you a tremendous amount of energy in the long run.

Of course, sometimes you just can't get all the tool support you need in a straightforward manner. However, now that you've read this article, Perl can help you. Here are your options:

You are dealing with XML documents in English only. No problem then, XML::Parser will work for you (until of course you need to write foreign names, in which case you will need to use XML::Code or a similar solution).
Your tools use only Latin 1 encoding. In this case you can store your documents in Latin 1 (don't forget to set the encoding declaration to ISO-5589-1), use UTF-8 when processing them with XML::Parser, then export them in Latin 1 using Unicode::String or tr/// in Perl 5.6+.
Your tools use some other encoding not supported natively by XML::Parser. In this case you can use XML::Encoding and XML::UM to allow you to "round-trip" your documents.
The encoding of your documents is not supported by XML::Parser and XML::Encoding at all. Your best bet in this situation is probably to write the encoding map you need (and release it so the problem goes away for this encoding!).

This article should at least get you started on encodings. Now all I have to do is read it once more, and go back to XML::Twig to incorporate all the ideas I had while writing this!

Resources
	• XML::Parser: the Perl XML parser
	• XML::Encoding: adds various encodings to XML::Parser
	• Unicode::String: converts from UTF-8 to ISO-8859-1 (latin 1)
	• XML::UM (in libxml-enno): converts from UTF-8 to any encoding covered in XML::Encoding
	• XML::Code: converts from UTF-8 to ASCII + XML Character Entities