Encodings in XML::Parser: Conclusion

April 26, 2000

Michel Rodriguez

Table of Contents
XML::Parser and Character
Encodings in XML::Parser:

Unicode is really cool. I mean really cool! I can't begin to tell you what a pain it is to deal with special characters -- even something as trivial as accented characters in a name -- in a non-Unicode environment. And using GIF images for Greek letters is really not satisfying.

So, the way to go is really to try to use Unicode as much as possible. If your environment is not Unicode-enabled, your first priority should be to try to upgrade your tools to get a fully Unicode system. Make it a criteria when you get new tools (pressure your vendors to add Unicode support). It will save you a tremendous amount of energy in the long run.

Of course, sometimes you just can't get all the tool support you need in a straightforward manner. However, now that you've read this article, Perl can help you. Here are your options:

  • You are dealing with XML documents in English only. No problem then, XML::Parser will work for you (until of course you need to write foreign names, in which case you will need to use XML::Code or a similar solution).

  • Your tools use only Latin 1 encoding. In this case you can store your documents in Latin 1 (don't forget to set the encoding declaration to ISO-5589-1), use UTF-8 when processing them with XML::Parser, then export them in Latin 1 using Unicode::String or tr/// in Perl 5.6+.

  • Your tools use some other encoding not supported natively by XML::Parser. In this case you can use XML::Encoding and XML::UM to allow you to "round-trip" your documents.

  • The encoding of your documents is not supported by XML::Parser and XML::Encoding at all. Your best bet in this situation is probably to write the encoding map you need (and release it so the problem goes away for this encoding!).

This article should at least get you started on encodings. Now all I have to do is read it once more, and go back to XML::Twig to incorporate all the ideas I had while writing this!

• XML::Parser: the Perl XML parser
XML::Encoding: adds various encodings to
Unicode::String: converts from UTF-8
  to ISO-8859-1 (latin 1)
XML::UM (in libxml-enno): converts from UTF-8 to any
  encoding covered in XML::Encoding
XML::Code: converts from UTF-8 to
  ASCII + XML Character Entities