Poor understanding of Unicode is probably the biggest obstacle users face when trying to learn how to process XML, and Python users are no exception. In my experience, Unicode matters are the most common component in users' cries for help with Python XML tools. In this article and the next I'll present a variety of tips, tricks, and best practices in order to help users minimize Unicode problems. I've covered a few of these issues in the past, but here I discuss the area more broadly and more deeply. I shall not be presenting full tutorials of Unicode in XML or Python because others have covered these areas rather well. I've gathered a lot of relevant resources in the sidebar. If you're not comfortable with the basics, or need a good reference, please consult it.
Picking the Right Tools
Proper Unicode support is so important for any XML tool that I would go as far as to say it's the single most important criteria in tool selection. Do not use any XML tools that do not have solid Unicode support at the core. This is one of the things I look out for when examining tools for this column, but unfortunately it's not always easy to tell when a package has poor Unicode support. Luckily, all the most widely used XML tools in Python have sufficient Unicode support. One tool in a grey area is PyRXP, as I discussed in my earlier article Introducing PyRXP. The standard build does not offer proper Unicode support, but you can create an alternative build called "PyRXPU" which does. I include details on how to do so in the referenced article. It is essential to be sure you're using PyRXPU rather than PyRXP, if that overall package is your preference. I'm still hoping that pyRXPU will become the default build in the next release.
I have tests that I run in order to look for tell-tale signs of problems. Some of these I've used in one form or another in previous articles, but they boil down to checking how the parser handles several characters, both in both raw and character entity form. I have an example of a file that all compliant parsers should be able to handle, but rather than paste it directly in this article, or provide it as a download, I present listing 1, which is Python code to generate this test file. This way you have an early example of some of the Python Unicode API's I'll be discussing in this article. Running listing 1 results in an XML file in UTF-8 encoding which contains three Unicode characters, one above ASCII range but below the 256th Unicode character, another above the 256th character, and a third right at the 65536th character boundary, which is a well-known boundary at which some Unicode tools get tripped up. Each character is present in raw form and as a character entity.
Listing 1: Python code to write out a test XML file
import codecs enc, dec, read_wrap, write_wrap = codecs.lookup('utf-8') f = open('utf8test.xml', 'wb') f = write_wrap(f) lines = [ u"<?xml version='1.0' encoding='utf-8'?>\n", u"<unitest>\n", u" <!-- U+00F7 DIVISION SIGN -->\n", u" <divsign><raw>\u00F7</raw><charent>÷</charent> </divsign>\n", u" <!-- U+2026 HORIZONTAL ELLIPSIS-->\n", u" <ell><raw>\u2026</raw><charent>…</charent></ell> \n", u" <!-- U+10000 LINEAR B SYLLABLE B008 A-->\n", u" <linb><raw>\U00010000</raw><charent>𐀀</charent> </linb>\n", u"</unitest>\n" ] f.writelines(lines) f.close()
After creating this test file, use your favorite tool to parse it. If you get any errors, the parser has a problem. If you don't get any errors, you should still check that the right characters were read. Use your tool's API to extract Unicode objects for each of the six non-ASCII characters in the test file, and put them into variables named, respectively:
rawelement within the
charentelement within the
rawelement within the
charentelement within the
rawelement within the
charentelement within the
Then compare these values to the correct ones:
As an example, listing 2 runs the test against 4Suite's Domlette (see "A Tour of 4Suite").
Listing 2: Test code for 4Suite against the document generated by listing 1
#The top section is specific to the XML tool being tested from Ft.Xml.Domlette import NonvalidatingReader doc = NonvalidatingReader.parseUri("file:utf8test.xml") divsign_r = doc.xpath(u'string(/unitest/divsign/raw)') divsign_c = doc.xpath(u'string(/unitest/divsign/charent)') ell_r = doc.xpath(u'string(/unitest/ell/raw)') ell_c = doc.xpath(u'string(/unitest/ell/charent)') linb_r = doc.xpath(u'string(/unitest/linb/raw)') linb_c = doc.xpath(u'string(/unitest/linb/charent)') #From this point down should be the same for all test cases EXPECTED = (u'\xf7', u'\xf7', u'\u2026', u'\u2026', u'\U00010000', u'\U00010000') assert (divsign_r, divsign_c, ell_r, ell_c, linb_r, linb_c) == EXPECTED #If you see this message, the tool passed the test print "Passed assert, so passed test"
If you want to close the loop on making sure your tool of choice is Unicode safe, you should also try to use it to output characters such as those in the test file (you may not be able to produce the exact test file itself, as some variability of tool is expected. Just be sure your tool can output the test characters in some correct form.)
Making Unicode Out of Strings
In Proper XML Output in Python I stated a rule: In all public API's for XML processing, character data should be passed in strictly as Python Unicode objects. This is primarily an admonition for XML API designers, but it also applies to users because many API's allow you to pass in strings or Unicode objects interchangeably. Resist the temptation to use this flexibility. Convert all strings to Unicode when passing them to XML API's. Doing so isn't always as easy as you might think. Here are some tips and pitfalls.
The Unicode conversion function. The first thing you might
try to do is to just wrap your strings in a call to
unicode, which converts objects (including strings) to
Unicode, where possible. If you call it with only one argument, the
object to be converted, it uses the default encoding for your "site" (a
special Python module with custom settings for a particular Python
install). Unless you've added a
specifies a different default encoding, you probably have ASCII as your
default encoding. This means that if the string you try to convert has
any characters in it above the ordinal 127, you will get the dreaded
UnicodeDecodeError: 'ascii' codec can't
decode byte XXX in position XXX: ordinal not in range(128). One
way to avoid this is to always specify an encoding. It then uses the
corresponding codec to drive the conversion from the string value of the
object. A codec is a Python module that designed to convert encoded
strings to Unicode objects, and vice versa. Of course, the encoding you
specify has to to be the right one. This is not just a matter of
getting an error message if the codec can't handle the string. Even
worse, the codec might not detect a character, but it could end up
misinterpreting the characters, and you end up with a garbled string as
a silent error. If you're sure the strings you're passing are in
iso-8859-1 (fairly likely if you work in a Western European or American
environment), you can perform the conversion as follows:
unicode_obj = unicode(string_obj, 'iso-8859-1')
Using codecs directly. A more general and flexible technique is to use the decoding function of the codec directly. To do this, look up the codec by name, and you get a set of objects directly from the codec. The second one is the decode function for that codec.
import codecs dec = codecs.lookup('iso-8859-1') unicode_obj = dec(string_obj)
 applied to the result from the codec. Such decode
functions return a tuple of the resulting unicode object and the
number of bytes that were read from the string in order to perform the
decode operation. All you want is the the first item.
And Strings Out of Unicode
As you read in XML using Python tools, you should expect to be
getting all the bits and pieces in the form of Unicode objects. You
should certainly keep these in the form of Unicode as much as possible,
especially if they will eventually make their way back into another XML
document, or some other internationalized usage. Sometimes, however,
you'll need to use them in string form, and you'll need to reverse the
operation discussed in the previous section. The most straightforward
way to convert a Unicode object to string is to use the
encode method on the Unicode object:
string_obj = unicode_obj.encode('iso-8859-1')
You can also use the full codecs module. The first item in the tuple
codecs.lookup() is an encode function:
import codecs enc, dec = codecs.lookup('iso-8859-1')[:2] string_obj = enc(unicode_obj)
Again, you need to grab only the first item from the encode function's return value, the second of which is the number of characters that were encoded in the given Unicode object.
I'll continue this discussion in the next column with coverage of using Unicode with file-like objects, and more. I encourage you to digest at least some of the material I presented in the sidebar, which will give you a head start on the remaining discussion, and will certainly make you a better XML and Python hacker overall.
Meanwhile, in my usual round-up of happenings in the Python-XML community, I start with XIST 2.9. I covered XIST in the recent article Writing and Reading XML with XIST. It is a very capable, open-source package for XML and HTML processing and generation. The long list of changes is given in Walter Dörwald's announcement.
Also in Python and XML
Henry Thompson announced XSV 2.10-1. XML Schema Validator (XSV) is a GPLed W3C XML Schema (WXS) validator written in Python, and the engine behind the W3C's WXS validation service. There seem to be a lot of changes, but it's hard to be clear exactly which ones apply to the current release, so see the web page for more information.
Sylvain Thénault announced LogiLab's xmldiff 0.6.7, "a utility for extracting differences between two xml files. It returns a set of primitives to apply on source tree to obtain the destination tree.". Xmldiff uses XUpdate, which I discussed in a recent weblog entry, to represent differences between XML documents. You can then use an XUpdate tool to "patch" XML files with the diff.
For francophone readers, "Rémi" announced: "I was allowed to open a new Wiki page with the modest knowledge I've picked up of Python and XML." Wiki : Python et XML is a French language resource for Python-XML tools and techniques.
Unicode Resources for Python and XML
I covered a lot of the issues in a a very step-by-step manner in my earlier article, Proper XML Output in Python
Marc-Andre has a tutorial PDF slide set that serves as a great beginner's guide to Unicode in Python.
Chapter nine of Dive Into Python by Mark Pilgrim covers XML processing, and section 9.4. Unicode focuses on the topic of this article. The book is a freely available electronic text, and is also available in paper form, which you should certainly consider purchasing if you find the online resource useful.
Jason Orendorff has a brief but useful article, Unicode in Python, which is part of a series looking at Unicode in a variety of languages and environments. He also has a section called Unicode in HTML and XML.
Evan Jones offers some notes, How to Use UTF-8 with Python, focusing on the UTF-8 encoding and minidom.
Fredrik Lundh offers some concise and practical tips and techniques in Python Unicode Objects.
One useful general resource is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky.
Andy Robinson provides a nice overview in his Python Unicode Tutorial. It is a little bit out of date, but there are many code samples, almost all of which should still work.
PEP 100 by Marc-Andre Lemburg formally describes the Python Unicode system.
Roundup: Another XSLT filter
Since you mentioned the CherryPy filter in the roundup, I thought I'd mention this is recent filter that interprets XSLT (this one built on the WSGI standard):
Note from John Cowan
2005-06-02 14:58:31 Uche Ogbuji
John Cowan had problems posting, but he has an importan note, so here it is:
"It is very important for Western European and American Windows users to set the local conversion mode to CP-1252, not ISO 8859-1. Using 8859-1 means that the Windows characters at 0x80-0x9F (curly quotes, s-hacek, ellipsis, etc.) get converted to U+0080 to U+009F, which are valid but useless. Using CP-1252 gets them converted to Unicode correctly."
2005-05-20 05:16:42 Uche Ogbuji
That also works. I considered mentioning it, but I thought it wasn't all that necessary having pointed out unicode(). My intent was to go from a simple technique (the Unicode conversion function) to the full artillery (the codecs module), so that after reading the article the user has both options. After reading your comment, they should have three. Thanks.
Why not use:
unicode_obj = string_obj.decode('iso-8859-1')
dec = codecs.lookup('iso-8859-1')
unicode_obj = dec(string_obj)