Menu

Encodings in XML::Parser: Examples

April 26, 2000

Michel Rodriguez


Table of Contents
Introduction
XML::Parser and Character
  Encodings
Encodings in XML::Parser:
   Examples
Conclusion

It's time for some practical examples. All of these examples have one goal: take a document in and output it back. The input and output documents are identical, except that the encoding might be different. They all use only XML::Parser, with no style set.

These examples are not much use as-is, but they will give you a feel of what dealing with various encodings means, and can be used as a starting point when developing with XML::Parser.

Note: this section requires a basic knowledge of XML::Parser. For an introduction to this module, see Clark Cooper's article Using The Perl XML::Parser Module.

The Documents

I created three documents, one including Western characters (French), one including Japanese characters, and another one in Chinese. In order to allow viewing of the examples without playing with the character encodings in your browser, the documents are displayed as graphics. I then saved them in a variety of encodings.

The French example:

The Chinese example:

The Japanese example:

Basic Example (UTF-8 in, UTF-8 out)

Script: enc_ex1.pl
Input documents: doc_fr_utf.xml, doc_ch_utf.xml, or doc_jp_utf.xml

This example doesn't do much -- it just takes the input document and writes it as-is to the output -- but at least it parses the document. It also demonstrates the framework of what we are going to do in the next examples.

#!/bin/perl -w

use strict;

use XML::Parser;

my $p= new XML::Parser( Handlers =>

                         { Start   => \&default,

                           End     => \&default,

                           Default => \&default

                         },

                      );

$p->parsefile($ARGV[0]);

exit;



# by default print the UTF-8 encoded string received from the parser

sub default

  { my $p= shift;

    my $string= $p->recognized_string();

    print $string;

  }

Arbitrary Encoding to UTF-8

This example is pretty similar to the previous one, but with the added processing complexity of needing to remove the original encoding declaration from the XML declaration to reflect the fact that the output encoding is UTF-8.

Script: enc_ex2.pl
Input documents: doc_fr_latin1.xml (there is no need to use XML::Encoding with this one), doc_ch_big5.xml, doc_jp_sjis.xml, or doc_jp_euc.xml

#!/bin/perl -w

use strict;

use XML::Parser;

use XML::Encoding;

my $p= new XML::Parser( Handlers =>

                         { Start   => \&default,

                           End     => \&default,

                           Default => \&default,

                           XMLDecl => sub { decl( "UTF-8", @_); }

                         },

                      );

$p->parsefile($ARGV[0]);

exit;



# update the encoding

sub decl

  { my( $new_encoding, $p, $version, $encoding, $standalone)=@_;

    print "<?xml version=\"$version\" encoding=\"$new_encoding\"";

    print "standalone=\"$standalone\"" if( $standalone);

    print "?>\n";

  }



# by default print the UTF-8 encoded string received from the parser

sub default

  { my $p= shift;

    my $string= $p->recognized_string();    

    print $string;

  }

Parsing HTML-like Documents

In this next example we will need to parse a document in basic ASCII (hence it is also a UTF-8 document), which includes non-UTF-8 encoded entities such as &eacute;. This often happens when dealing with XHTML, or just "slightly-enhanced HTML" documents, so this is quite a common problem.

XML has no built-in notion of these entities, so we need to include an entity declaration file that will just convert those to UTF-8. I usually use a single file named html2utf8.ent, which I grabbed from the W3C Character entity references. Note that this file is in no way authoritative, just convenient. Here is how these entities are referenced at the start of an XML document:


<!DOCTYPE doc [

  <!ENTITY % html2utf8 SYSTEM "html2utf8.ent">

 %html2utf8;

 ]>

An important restriction of this type of document is that element and attribute names are restricted to basic ASCII. So the document in doc_fr_html.xml is a little different from the others.

Here is the script: enc_ex3.pl.

#!/bin/perl -w

use strict;

use XML::Parser;

my $p= new XML::Parser( Handlers =>

                         { Start => \&default,

                           End   => \&default,

                           Char  => \&default, 

                         },

                        ParseParamEnt => 1

                      );

$p->parsefile($ARGV[0]);

exit;



# by default print the UTF-8 encoded string received from the parser

sub default

  { my $p= shift;

    my $string= $p->recognized_string();

    print $string;

  }

Unicode In, Latin 1 Out

The script (enc_ex4.pl) can be written in one of two ways, depending on whether you are using Perl 5.6 or an older version. You can use it with this document: doc_fr_latin1.xml.

For pre-5.6 Perl you will need to use the Unicode::String module:

#!/bin/perl -w

use strict;

use XML::Parser;

use Unicode::String;



my $p= new XML::Parser( Handlers =>

                 { Start   => \&default_latin1,

                   End     => \&default_latin1,

                   XMLDecl => sub { decl( "ISO-8859-1",@_); },

                   Default => \&default_latin1

                         },

                      );

$p->parsefile($ARGV[0]);

exit;



sub decl

  { my( $new_encoding, $p, $version, $encoding, $standalone)=@_;

    print "<?xml version=\"$version\" encoding=\"$new_encoding\"";

    print "standalone=\"$standalone\"" if( $standalone);

    print "?>\n";

  }



sub default_latin1                          

  { my $p= shift;

    my $string= $p->recognized_string();   # get the UTF8 string

    my $u= Unicode::String::utf8( $string);# create Unicode::String 

    print $u->latin1;                      # convert string to latin1

  }

With Perl 5.6, you don't need to use the Unicode::String module and can use a tr/// to convert the character, so default_latin1 becomes:

sub default_latin1                          

  { my $p= shift;3

    my $string= $p->recognized_string();   # get the UTF8 string

    my $string=~ tr/\0-\x{ff}//UC;         # convert string to latin1

    print $string;

  }

Non-Unicode Encoding In, Same Encoding Out

If your original document is in Latin 1, you can use the previous script; if it is in some other encoding, you will have to use XML::UM, as demonstrated by the following script (enc_ex5.pl):

#!/bin/perl -w

use strict;

use XML::Parser;

use XML::Encoding;

use XML::UM;



# we need to tell XML::UM where the encoding maps are

$XML::UM::ENCDIR="/usr/local/src/XML-Encoding-1.01/maps/";



my $encode; # the encoding function, created in the XMLDecl handler



my $p= new XML::Parser( Handlers =>

                         { Start   => \&default,

                           End     => \&default,

                           XMLDecl => \&decl,

                           Default => \&default,

                         },

                      );

$p->parsefile($ARGV[0]);

exit;



sub default

  { my $p= shift;

    print $encode->($p->recognized_string());

  }



# get the encoding and create the encode function

sub decl

  { my($p, $version, $encoding, $standalone)=@_;

    print "<?xml version=\"$version\" encoding=\"$encoding\"";

    print "standalone=\"$standalone\"" if( $standalone);

    print "?>\n";

    $encode= XML::UM::get_encode( Encoding => $encoding);

  }