More Unicode Secrets

June 15, 2005

In the last article I started a discussion of the Unicode facilities in Python, especially with XML processing in mind. In this article I continue the discussion. I do want to mention that I don't claim these articles to be an exhaustive catalogue of Unicode APIs; I focus on the Unicode APIs I tend to use most in my own XML processing. You should follow up these articles by looking at the further resources I mentioned in the first article.

I also want to mention another general principle to keep in mind: if possible, use a Python install compiled to use UCS4 character storage. You can determine when you configure Python before building it whether it stores Unicode characters using (informally) a two-byte or a four-byte encoding, UCS2 or UCS4. UCS2 is the default but you can override this by passing the --enable-unicode=ucs4 flag to configure. UCS4 uses more space to store characters, but there are some problems for XML processing in UCS2, which the Python core team is reluctant to address because the only known fixes would be too much of a burden on performance. Luckily, most distributors have heeded this advice and ship UCS4 builds of Python.

Wrapping Files

In the last article I showed how to manage conversions from strings to Unicode objects. In dealing with XML APIs you often deal with file-like objects (stream objects) as well. Most file systems and stream representations are byte-oriented rather that character-oriented, which means that Unicode must be encoded for file output and that file input must be decoded for interpretation as Unicode. Python provides facilities for wrapping stream objects so that such conversions are largely transparent. Consider the codecs.open function.


import codecs

f = codecs.open('utf8file.txt', 'w', 'utf-8')

f.write(u'abc\u2026')

f.close()

The first two arguments to codecs.open are just like the arguments to the built-in function open. The third argument is the encoding name. The return value is the open file pointer. You then use the write method, passing in Unicode objects, which are encoded as specified and written to the file. I can't possibly reiterate the distinction between bytes and characters enough. Look closely at what is written to the file in the snippet above.

>>> len(u'abc\u2026')

4

>>>

There are four characters: three lowercase letters and the horizontal ellipsis symbol. Examine the resulting file. I use hexdump on Linux. There are many similar utilities on all operating systems.

$ hexdump -c utf8file.txt

0000000   a   b   c 342 200 246

0000006

This means that there are six bytes in the file. The first three are as you would expect, and the second three are all used to encode a single Unicode character in UTF-8 form (the bytes are given in octal form above; in hex form they are e2 80 a6). If you were to read this file with a tool that was not aware that this is a UTF-8 encoded file, it might misinterpret the contents, which is a hard problem overall in dealing with encoded files. (See Rick Jelliffe's article, referenced in the sidebar, for more discussion of this issue.)

Understanding BOMs

Some encodings have additional details you have to keep in mind. The following code creates a file with the same characters, but encoded in UTF-16.

import codecs

f = codecs.open('utf16file.txt', 'w', 'utf-16')

f.write(u'abc\u2026')

f.close()

Examine the contents of the resulting file. If you're using hexdump, this time it's actually more useful to use a different (hexadecimally-based) output formatting option.

$ hexdump -C utf16file.txt

00000000  ff fe 61 00 62 00 63 00  26 20                    |..a.b.c.& |

0000000a

There are 10 bytes in this case. In UTF-16 most characters are encoded in two bytes each. The four Unicode characters are encoded into eight bytes, which are the last eight in the file. This leaves the first two bytes unaccounted for. Unicode has a means of flagging encodings in order to specify the order in which the characters should be read from bytes. These flags come in the form of the encoding of a special character code point called "byte order marks" (BOMs). This is necessary in part because different machines use different means of ordering "words" (pairs of consecutive bytes starting at even machine addresses) and "double words" (pairs of consecutive words starting at machine addresses divisible by four). The difference in word order is all that is relevant in the case of UTF-16.

If you were to place the latter eight characters from the above example in a file and send it from a machine with one byte ordering to a machine with another type of ordering, programming tools (including Python code) would read the characters backwards, scrambling the contents. Unicode uses BOMs to mark byte order so that machines with different ordering will be able to figure out the right way to read characters. The BOM for UTF-16 comprises the bytes ff and fe, which completes the puzzle of the contents of the file generated in the above example. The relative position of the ff byte signals the least significant position and fe signals the most significant. You can see how this works when looking at the next word 61 00. By following the BOM you can tell that 61 is least significant and 00 is most significant. This happens to be what is called little-endian byte order (which is usual for Intel machines). Many other machines, including Motorola microprocessors, use big-endian byte order, and the order would be reversed in the BOM, as well as in all the other characters. Unicode tools know how to look for and interpret the BOM in files, and the above file contents should be properly interpreted by any UTF-16-aware tool, even in a language other than Python.

Deciding which encoding to choose is a very complex issue, although I recomend that you stick to UTF-8 and UTF-16 for uses associated with XML processing. One consideration that might help you choose between these two encodings is that UTF-8 tends to use fewer bytes when encoding text heavy on European and Middle-Eastern characters, and some Asian scripts. UTF-16 tends to use fewer bytes when encoding text heavy in Chinese, Japanese, Korean, Vietnamese (the "CJKV" languages) and the like.

You can use codecs.open again for reading the files created above:

import codecs

f = codecs.open('utf8file.txt', 'r', 'utf-8')

u1 = f.read()

f.close()

f = codecs.open('utf16file.txt', 'r', 'utf-16')

u2 = f.read()

f.close()

assert u1 == u2

Again Python takes care of all the BOM details transparently.

Wrapping File-like Objects

codecs.open does the trick for wrapping files, but not other types of stream objects (such as sockets or StringIO string buffers). You can handle these using wrappers you obtain using the codecs.lookup function. In the last article I showed how to use this function to get encoding and decoding routines (the first two items in the returned tuple).

import codecs

import cStringIO

enc, dec, reader, writer = codecs.lookup('utf-8')

buffer = cStringIO.StringIO()

#Wrap the buffer for automatic encoding

wbuffer = writer(buffer)

content = u'abc\u2026'

wbuffer.write(content)



bytes = buffer.getvalue()

#Create the buffer afresh, with the bytes written out

buffer = cStringIO.StringIO(bytes)



#Wrap the buffer for automatic decoding

rbuffer = reader(buffer)

content = rbuffer.read()



print repr(content)

In this example I've completed a round trip from a Unicode object to an encoded byte string, which was built using a StringIO object, and back to a Unicode object read in from the byte string.

If you need to use one of these functions from codecs.lookup, and don't want to bother with the other three, you can get them directly using the functions codecs.getencoder, codecs.getdeccoder codecs.getreader, and codecs.getwriter.

If you need to deal with stream objects you can read and write to without having to close and reopen it (in a database storage scenario, for example), you'll want to look into the class codecs.StreamReaderWriter, which wraps separate codec reader and writer objects to provide a combination object.

Unicode Character Representation in Python and XML

XML and Python have different means of representing characters according to their Unicode code points. You have seen the horizontal ellipsis character above in Python Unicode form \u2026 where the "2026" is the character ordinal in hexadecimal. This is a 16-bit Python Unicode character escape. You can also use a 32-bit escape, marked by a capital "U", \U00002026. In XML you either use a decimal character escape format, …, where "8230" is just hex "2026" in decimal, or you can use hex directly: …. Notice the added "x".

In XML you would use these escapes when you are using an encoding that does not allow you to enter a character literally. As an example, XML allows you to include an ellipsis character even in a document that is encoded in plain ASCII, as illustrated in the example in listing 1. Since there is no way to express the character with code point 2026 in ASCII, I use a character escape. A conforming XML application must be able to handle this document, reporting the right Unicode for the escaped character (and this is another good test for conformance of your tools).

Listing 1: XML file in ASCII encoding that uses a high character

<?xml version='1.0' encoding='us-ascii'?>

<doc>abc&#x2026;</doc>

Python can take care of such escaping for you. If you want to write out XML text, and you're using an encoding—ASCII, ISO-8859-1, EUC-JP, cp1252—that may not include all valid XML characters, you can use a special ability of Python codecs to specify special actions on encoding errors.

>>> import codecs

>>> enc = codecs.getencoder('us-ascii')

>>> print enc(u'abc\u2026')[0]

Traceback (most recent call last):

  File "<stdin>", line 1, in ?

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 3: 

   ordinal not in range(128)

You can avoid this error by specifying 'xmlcharrefreplace' as the error handler.

>>> print enc(u'abc\u2026', 'xmlcharrefreplace')[0]

abc&#8230;

More Unicode Resources for
Python and XML

XML and Unicode expert Rick Jelliffe provides some good sense for dealing with Unicode in data formats with the somewhat deceptively titled "Unicode has too many characters". In this article he refers to yet another "Unicode for XML people" article "Entry-Level Unicode for XML", by Jon Hanna.
Python's handy unicodedata module is based on data files from The Unicode Consortium, but its documentation is a bit out of date. (Never mind the fact that Python's Unicode database is still based on Unicode 3.2.) This is the correct link to the raw Unicode Character Database (for Unicode version 4.1.0 at the time of writing). For people referring to the database, the best starting point is the HTML overview of the database.
I mentioned problems with UCS2 builds of Python. Eric van der Vlist first brought this to my attention a few years ago. There was a very long thread about this on the XML-SIG. I summarized the whole situation in my Akara article "Character issues in Python". See the part starting with "I would recommend that you always build Python with UCS4 support..."
Although it is a bit dated, the best resource available on processing with the "CJKV" languages is probably still the book CJKV Information Processing, by Ken Lunde (O'Reilly and Associates, 1998).

There are other available error handlers, but they are not as interesting for XML processing.

Conclusion

Let me reiterate that for each of the areas of interest I've covered in Python's Unicode support, there are additional nuances and possibilities that you might find useful. I've generally restricted the discussion to techniques that I have found useful when processing XML, and you should read and explore further in order to uncover even more Unicode secrets. Let me also say that even though some of the techniques I've gone over will enable you to generate correct XML, there is more to well-formedness than just getting the Unicode character model right. For example, there are some Unicode characters that are not allowed in XML documents, even in escaped form. I still recommend that you use one of the many tools I've discussed in this column for generating XML output.

It's quiet time again in the Python-XML community. I did present some code snippets for reading a directory subtree and generating an XML representation, see "XML recursive directory listing, part 2", as well as some Python/Amara equivalents of XQuery and XSLT 2.0 code. There has also been a lot of buzz about Google Sitemaps (currently in beta). Web site owners can create an XML representation of their site, including indicators of updated content. The Google crawlers then use this information to improve coverage of the indexed Web sites. The relevance to this column is that Google has developed sitemap_gen.py, a Python script that "analyzes your web server and generates one or more Sitemap files. These files are XML listings of content you make available on your web server. The files can then be directly submitted to Google." The code uses plain byte string buffer write operations to generate XML. I don't recommend this practice in general, but it seems that the subset of data the Google script includes in the XML file (URLs and last modified dates) is safely in the ASCII subset. (Although as IRIs become more prevalent, this assumption may prove fragile.) It also uses xml.sax and minidom to read XML (mostly the config files in the former case and examples for testing in the latter).