Unicode Secrets

May 18, 2005

Poor understanding of Unicode is probably the biggest obstacle users face when trying to learn how to process XML, and Python users are no exception. In my experience, Unicode matters are the most common component in users' cries for help with Python XML tools. In this article and the next I'll present a variety of tips, tricks, and best practices in order to help users minimize Unicode problems. I've covered a few of these issues in the past, but here I discuss the area more broadly and more deeply. I shall not be presenting full tutorials of Unicode in XML or Python because others have covered these areas rather well. I've gathered a lot of relevant resources in the sidebar. If you're not comfortable with the basics, or need a good reference, please consult it.

Picking the Right Tools

Proper Unicode support is so important for any XML tool that I would go as far as to say it's the single most important criteria in tool selection. Do not use any XML tools that do not have solid Unicode support at the core. This is one of the things I look out for when examining tools for this column, but unfortunately it's not always easy to tell when a package has poor Unicode support. Luckily, all the most widely used XML tools in Python have sufficient Unicode support. One tool in a grey area is PyRXP, as I discussed in my earlier article Introducing PyRXP. The standard build does not offer proper Unicode support, but you can create an alternative build called "PyRXPU" which does. I include details on how to do so in the referenced article. It is essential to be sure you're using PyRXPU rather than PyRXP, if that overall package is your preference. I'm still hoping that pyRXPU will become the default build in the next release.

I have tests that I run in order to look for tell-tale signs of problems. Some of these I've used in one form or another in previous articles, but they boil down to checking how the parser handles several characters, both in both raw and character entity form. I have an example of a file that all compliant parsers should be able to handle, but rather than paste it directly in this article, or provide it as a download, I present listing 1, which is Python code to generate this test file. This way you have an early example of some of the Python Unicode API's I'll be discussing in this article. Running listing 1 results in an XML file in UTF-8 encoding which contains three Unicode characters, one above ASCII range but below the 256th Unicode character, another above the 256th character, and a third right at the 65536th character boundary, which is a well-known boundary at which some Unicode tools get tripped up. Each character is present in raw form and as a character entity.

Listing 1: Python code to write out a test XML file

import codecs



enc, dec, read_wrap, write_wrap = codecs.lookup('utf-8')



f = open('utf8test.xml', 'wb')

f = write_wrap(f)



lines = [

u"<?xml version='1.0' encoding='utf-8'?>\n",

u"<unitest>\n",

u"  <!-- U+00F7 DIVISION SIGN -->\n",

u"  <divsign><raw>\u00F7</raw><charent>&#xf7;</charent>

</divsign>\n",

u"  <!-- U+2026 HORIZONTAL ELLIPSIS-->\n",

u"  <ell><raw>\u2026</raw><charent>&#x2026;</charent></ell>

\n",

u"  <!-- U+10000  LINEAR B SYLLABLE B008 A-->\n",

u"  <linb><raw>\U00010000</raw><charent>&#x10000;</charent>

</linb>\n",

u"</unitest>\n"

]



f.writelines(lines)

f.close()

After creating this test file, use your favorite tool to parse it. If you get any errors, the parser has a problem. If you don't get any errors, you should still check that the right characters were read. Use your tool's API to extract Unicode objects for each of the six non-ASCII characters in the test file, and put them into variables named, respectively:

divsign_r (content of raw element within the divsign element)
divsign_c (content of charent element within the divsign element)
ell_r (content of raw element within the ell element)
ell_c (content of charent element within the ell element)
linb_r (content of raw element within the linb element)
linb_c (content of charent element within the linb element)

Then compare these values to the correct ones:

u'\xf7'
u'\xf7'
u'\u2026'
u'\u2026'
u'\U00010000'
u'\U00010000'

As an example, listing 2 runs the test against 4Suite's Domlette (see "A Tour of 4Suite").

Listing 2: Test code for 4Suite against the document generated by listing 1

#The top section is specific to the XML tool being tested

from Ft.Xml.Domlette import NonvalidatingReader

doc = NonvalidatingReader.parseUri("file:utf8test.xml")



divsign_r = doc.xpath(u'string(/unitest/divsign/raw)')

divsign_c = doc.xpath(u'string(/unitest/divsign/charent)')

ell_r = doc.xpath(u'string(/unitest/ell/raw)')

ell_c = doc.xpath(u'string(/unitest/ell/charent)')

linb_r = doc.xpath(u'string(/unitest/linb/raw)')

linb_c = doc.xpath(u'string(/unitest/linb/charent)')



#From this point down should be the same for all test cases



EXPECTED = (u'\xf7', u'\xf7',

            u'\u2026', u'\u2026',

            u'\U00010000', u'\U00010000')

assert (divsign_r, divsign_c, ell_r, ell_c, linb_r, linb_c) 

== EXPECTED



#If you see this message, the tool passed the test

print "Passed assert, so passed test"

If you want to close the loop on making sure your tool of choice is Unicode safe, you should also try to use it to output characters such as those in the test file (you may not be able to produce the exact test file itself, as some variability of tool is expected. Just be sure your tool can output the test characters in some correct form.)

Making Unicode Out of Strings

In Proper XML Output in Python I stated a rule: In all public API's for XML processing, character data should be passed in strictly as Python Unicode objects. This is primarily an admonition for XML API designers, but it also applies to users because many API's allow you to pass in strings or Unicode objects interchangeably. Resist the temptation to use this flexibility. Convert all strings to Unicode when passing them to XML API's. Doing so isn't always as easy as you might think. Here are some tips and pitfalls.

The Unicode conversion function. The first thing you might try to do is to just wrap your strings in a call to unicode, which converts objects (including strings) to Unicode, where possible. If you call it with only one argument, the object to be converted, it uses the default encoding for your "site" (a special Python module with custom settings for a particular Python install). Unless you've added a sitecustomize.py that specifies a different default encoding, you probably have ASCII as your default encoding. This means that if the string you try to convert has any characters in it above the ordinal 127, you will get the dreaded error message: UnicodeDecodeError: 'ascii' codec can't decode byte XXX in position XXX: ordinal not in range(128). One way to avoid this is to always specify an encoding. It then uses the corresponding codec to drive the conversion from the string value of the object. A codec is a Python module that designed to convert encoded strings to Unicode objects, and vice versa. Of course, the encoding you specify has to to be the right one. This is not just a matter of getting an error message if the codec can't handle the string. Even worse, the codec might not detect a character, but it could end up misinterpreting the characters, and you end up with a garbled string as a silent error. If you're sure the strings you're passing are in iso-8859-1 (fairly likely if you work in a Western European or American environment), you can perform the conversion as follows:

unicode_obj = unicode(string_obj, 'iso-8859-1')

Using codecs directly. A more general and flexible technique is to use the decoding function of the codec directly. To do this, look up the codec by name, and you get a set of objects directly from the codec. The second one is the decode function for that codec.

import codecs

dec = codecs.lookup('iso-8859-1')[1]

unicode_obj = dec(string_obj)[0]

Notice the [0] applied to the result from the codec. Such decode functions return a tuple of the resulting unicode object and the number of bytes that were read from the string in order to perform the decode operation. All you want is the the first item.

And Strings Out of Unicode

As you read in XML using Python tools, you should expect to be getting all the bits and pieces in the form of Unicode objects. You should certainly keep these in the form of Unicode as much as possible, especially if they will eventually make their way back into another XML document, or some other internationalized usage. Sometimes, however, you'll need to use them in string form, and you'll need to reverse the operation discussed in the previous section. The most straightforward way to convert a Unicode object to string is to use the encode method on the Unicode object:

string_obj = unicode_obj.encode('iso-8859-1')

You can also use the full codecs module. The first item in the tuple returned from codecs.lookup() is an encode function:

import codecs

enc, dec = codecs.lookup('iso-8859-1')[:2]

string_obj = enc(unicode_obj)[0]

Again, you need to grab only the first item from the encode function's return value, the second of which is the number of characters that were encoded in the given Unicode object.

Conclusion

I'll continue this discussion in the next column with coverage of using Unicode with file-like objects, and more. I encourage you to digest at least some of the material I presented in the sidebar, which will give you a head start on the remaining discussion, and will certainly make you a better XML and Python hacker overall.

Meanwhile, in my usual round-up of happenings in the Python-XML community, I start with XIST 2.9. I covered XIST in the recent article Writing and Reading XML with XIST. It is a very capable, open-source package for XML and HTML processing and generation. The long list of changes is given in Walter Dörwald's announcement.

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Making Old Things New Again

Henry Thompson announced XSV 2.10-1. XML Schema Validator (XSV) is a GPLed W3C XML Schema (WXS) validator written in Python, and the engine behind the W3C's WXS validation service. There seem to be a lot of changes, but it's hard to be clear exactly which ones apply to the current release, so see the web page for more information.

Sylvain Thénault announced LogiLab's xmldiff 0.6.7, "a utility for extracting differences between two xml files. It returns a set of primitives to apply on source tree to obtain the destination tree.". Xmldiff uses XUpdate, which I discussed in a recent weblog entry, to represent differences between XML documents. You can then use an XUpdate tool to "patch" XML files with the diff.

Sylvain Hellegouarch posted Picket, "a simple CherryPy filter for processing XSLT as a template language. It uses 4Suite to do the job."

For francophone readers, "Rémi" announced: "I was allowed to open a new Wiki page with the modest knowledge I've picked up of Python and XML." Wiki : Python et XML is a French language resource for Python-XML tools and techniques.

Unicode Resources for Python and XML

I covered a lot of the issues in a a very step-by-step manner in my earlier article, Proper XML Output in Python
Marc-Andre has a tutorial PDF slide set that serves as a great beginner's guide to Unicode in Python.
Chapter nine of Dive Into Python by Mark Pilgrim covers XML processing, and section 9.4. Unicode focuses on the topic of this article. The book is a freely available electronic text, and is also available in paper form, which you should certainly consider purchasing if you find the online resource useful.
Jason Orendorff has a brief but useful article, Unicode in Python, which is part of a series looking at Unicode in a variety of languages and environments. He also has a section called Unicode in HTML and XML.
Evan Jones offers some notes, How to Use UTF-8 with Python, focusing on the UTF-8 encoding and minidom.
Fredrik Lundh offers some concise and practical tips and techniques in Python Unicode Objects.
One useful general resource is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky.
Andy Robinson provides a nice overview in his Python Unicode Tutorial. It is a little bit out of date, but there are many code samples, almost all of which should still work.
PEP 100 by Marc-Andre Lemburg formally describes the Python Unicode system.
Two very handy sites to keep around for reference are the Unicode pages on FileFormat.info and the open internationalization resources directory.