More Unicode Secrets
In the last article I started a discussion of the Unicode facilities in Python, especially with XML processing in mind. In this article I continue the discussion. I do want to mention that I don't claim these articles to be an exhaustive catalogue of Unicode APIs; I focus on the Unicode APIs I tend to use most in my own XML processing. You should follow up these articles by looking at the further resources I mentioned in the first article.
I also want to mention another general principle to keep in
mind: if possible, use a Python install compiled to
use UCS4 character storage. You can determine when you
configure Python before building it whether it stores Unicode
characters using (informally) a two-byte or a four-byte encoding,
UCS2 or UCS4. UCS2 is the default but you can override this by
passing the --enable-unicode=ucs4 flag
to configure. UCS4 uses more space to store
characters, but there are some problems for XML processing in UCS2,
which the Python core team is reluctant to address because the only
known fixes would be too much of a burden on performance. Luckily,
most distributors have heeded this advice and ship UCS4 builds of
Python.
In the last article I showed how to manage conversions from strings to Unicode
objects. In dealing with XML APIs you often deal with file-like objects (stream
objects) as well. Most file systems and stream representations are byte-oriented
rather that character-oriented, which means that Unicode must be encoded
for file output and that file input must be decoded for interpretation as
Unicode. Python provides facilities for wrapping stream objects so that such
conversions are largely transparent. Consider the codecs.open function.
import codecs
f = codecs.open('utf8file.txt', 'w', 'utf-8')
f.write(u'abc\u2026')
f.close()
The first two arguments to codecs.open are just like the arguments
to the built-in function open. The third argument is the encoding
name. The return value is the open file pointer. You then use the write method,
passing in Unicode objects, which are encoded as specified and written to
the file. I can't possibly reiterate the distinction between bytes and characters
enough. Look closely at what is written to the file in the snippet above.
>>> len(u'abc\u2026')
4
>>>
There are four characters: three lowercase letters and the horizontal ellipsis
symbol. Examine the resulting file. I use hexdump on Linux.
There are many similar utilities on all operating systems.
$ hexdump -c utf8file.txt
0000000 a b c 342 200 246
0000006
This means that there are six bytes in the file. The first three are as
you would expect, and the second three are all used to encode a single Unicode
character in UTF-8 form (the bytes are given in octal form above; in hex
form they are e2 80 a6). If you were to read this file with
a tool that was not aware that this is a UTF-8 encoded file, it might misinterpret
the contents, which is a hard problem overall in dealing with encoded files.
(See Rick Jelliffe's article, referenced in the sidebar, for more discussion
of this issue.)
Some encodings have additional details you have to keep in mind. The following code creates a file with the same characters, but encoded in UTF-16.
import codecs
f = codecs.open('utf16file.txt', 'w', 'utf-16')
f.write(u'abc\u2026')
f.close()
Examine the contents of the resulting file. If you're using hexdump,
this time it's actually more useful to use a different (hexadecimally-based)
output formatting option.
$ hexdump -C utf16file.txt
00000000 ff fe 61 00 62 00 63 00 26 20 |..a.b.c.& |
0000000a
There are 10 bytes in this case. In UTF-16 most characters are encoded in two bytes each. The four Unicode characters are encoded into eight bytes, which are the last eight in the file. This leaves the first two bytes unaccounted for. Unicode has a means of flagging encodings in order to specify the order in which the characters should be read from bytes. These flags come in the form of the encoding of a special character code point called "byte order marks" (BOMs). This is necessary in part because different machines use different means of ordering "words" (pairs of consecutive bytes starting at even machine addresses) and "double words" (pairs of consecutive words starting at machine addresses divisible by four). The difference in word order is all that is relevant in the case of UTF-16.
If you were to place the latter eight characters from the above example in
a file and send it from a machine with one byte ordering to a machine with
another type of ordering, programming tools (including Python code) would read the characters
backwards, scrambling the contents. Unicode uses BOMs to mark byte order
so that machines with different ordering will be able to figure out the right
way to read characters. The BOM for UTF-16 comprises the bytes ff and fe,
which completes the puzzle of the contents of the file generated in the above
example. The relative position of the ff byte signals the least
significant position and fe signals the most significant.
You can see how this works when looking at the next word 61 00.
By following the BOM you can tell that 61 is
least significant and 00 is most significant. This happens to
be what is called little-endian byte order (which is usual for Intel machines).
Many other machines, including Motorola microprocessors, use big-endian byte
order, and the order would be reversed in the BOM, as well as in all the other characters.
Unicode tools know how to look for and interpret the BOM in files, and the
above file contents should be properly interpreted by any UTF-16-aware tool,
even in a language other than Python.
Deciding which encoding to choose is a very complex issue, although I recomend that you stick to UTF-8 and UTF-16 for uses associated with XML processing. One consideration that might help you choose between these two encodings is that UTF-8 tends to use fewer bytes when encoding text heavy on European and Middle-Eastern characters, and some Asian scripts. UTF-16 tends to use fewer bytes when encoding text heavy in Chinese, Japanese, Korean, Vietnamese (the "CJKV" languages) and the like.
You can use codecs.open again for reading the files created
above:
import codecs
f = codecs.open('utf8file.txt', 'r', 'utf-8')
u1 = f.read()
f.close()
f = codecs.open('utf16file.txt', 'r', 'utf-16')
u2 = f.read()
f.close()
assert u1 == u2
Again Python takes care of all the BOM details transparently.
codecs.open does the trick for wrapping files, but not other
types of stream objects (such as sockets or StringIO string
buffers). You can handle these using wrappers you obtain using the codecs.lookup function.
In the last article I showed how to use this function to get encoding and decoding routines (the first two items in the returned tuple).
import codecs
import cStringIO
enc, dec, reader, writer = codecs.lookup('utf-8')
buffer = cStringIO.StringIO()
#Wrap the buffer for automatic encoding
wbuffer = writer(buffer)
content = u'abc\u2026'
wbuffer.write(content)
bytes = buffer.getvalue()
#Create the buffer afresh, with the bytes written out
buffer = cStringIO.StringIO(bytes)
#Wrap the buffer for automatic decoding
rbuffer = reader(buffer)
content = rbuffer.read()
print repr(content)
In this example I've completed a round trip from a Unicode object to an
encoded byte string, which was built using a StringIO object,
and back to a Unicode object read in from the byte string.
If you need to use one of these functions from codecs.lookup,
and don't want to bother with the other three, you can get them directly
using the functions codecs.getencoder, codecs.getdeccoder codecs.getreader, and codecs.getwriter.
If you need to deal with stream objects you can read and write to without
having to close and reopen it (in a database storage scenario, for example), you'll
want to look into the class codecs.StreamReaderWriter, which
wraps separate codec reader and writer objects to provide a combination object.
XML and Python have different means of representing characters according
to their Unicode code points. You have seen the horizontal ellipsis character
above in Python Unicode form \u2026 where the "2026" is the
character ordinal in hexadecimal. This is a 16-bit Python Unicode character
escape. You can also use a 32-bit escape, marked by a capital "U", \U00002026.
In XML you either use a decimal character escape format, …,
where "8230" is just hex "2026" in decimal, or you can use hex directly: ….
Notice the added "x".
In XML you would use these escapes when you are using an encoding that does not allow you to enter a character literally. As an example, XML allows you to include an ellipsis character even in a document that is encoded in plain ASCII, as illustrated in the example in listing 1. Since there is no way to express the character with code point 2026 in ASCII, I use a character escape. A conforming XML application must be able to handle this document, reporting the right Unicode for the escaped character (and this is another good test for conformance of your tools).
Listing 1: XML file in ASCII encoding that uses a high character<?xml version='1.0' encoding='us-ascii'?>
<doc>abc…</doc>
Python can take care of such escaping for you. If you want to write out XML text, and you're using an encoding—ASCII, ISO-8859-1, EUC-JP, cp1252—that may not include all valid XML characters, you can use a special ability of Python codecs to specify special actions on encoding errors.
>>> import codecs
>>> enc = codecs.getencoder('us-ascii')
>>> print enc(u'abc\u2026')[0]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 3:
ordinal not in range(128)
You can avoid this error by specifying 'xmlcharrefreplace' as
the error handler.
>>> print enc(u'abc\u2026', 'xmlcharrefreplace')[0]
abc…
More Unicode Resources for
|
There are other available error handlers, but they are not as interesting for XML processing.
Let me reiterate that for each of the areas of interest I've covered in Python's Unicode support, there are additional nuances and possibilities that you might find useful. I've generally restricted the discussion to techniques that I have found useful when processing XML, and you should read and explore further in order to uncover even more Unicode secrets. Let me also say that even though some of the techniques I've gone over will enable you to generate correct XML, there is more to well-formedness than just getting the Unicode character model right. For example, there are some Unicode characters that are not allowed in XML documents, even in escaped form. I still recommend that you use one of the many tools I've discussed in this column for generating XML output.
It's quiet time again in the Python-XML community. I did present some code snippets for reading a directory subtree and generating an XML representation, see "XML recursive directory listing, part 2", as well as some Python/Amara equivalents of XQuery and XSLT 2.0 code. There has also been a lot of buzz about Google Sitemaps (currently in beta). Web site owners can create an XML representation of their site, including indicators of updated content. The Google crawlers then use this information to improve coverage of the indexed Web sites. The relevance to this column is that Google has developed sitemap_gen.py, a Python script that "analyzes your web server and generates one or more Sitemap files. These files are XML listings of content you make available on your web server. The files can then be directly submitted to Google." The code uses plain byte string buffer write operations to generate XML. I don't recommend this practice in general, but it seems that the subset of data the Google script includes in the XML file (URLs and last modified dates) is safely in the ASCII subset. (Although as IRIs become more prevalent, this assumption may prove fragile.) It also uses xml.sax and minidom to read XML (mostly the config files in the former case and examples for testing in the latter).
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.