EaseXML: A Python Data-Binding Tool
by Uche Ogbuji
|
Pages: 1, 2, 3
More on Unicode: Character Information
In the last two articles, Unicode Secrets and More Unicode Secrets, I discussed Python's Unicode facilities, from the point of view of XML processing. There is one more useful part of Python's Unicode libraries that I want to cover.
There are hundreds of thousands of characters in Unicode, and the number grows with each version. There is also a complex internal structure of characters; they are classified as alphabetic, digits, control codes, combining characters, and more, and they have varying collation (sorting), directionality, etc. It can be quite overwhelming, and you can imagine why when you realize that Unicode aims to provide computer representation for just about every writing system on the planet. Developers need all the tools they can to deal with all this rich variety. A useful but not all that well-known resource is Python's built-in Unicode database, in the unicodedata module. It is a Python API for the character database provided by the Unicode Consortium, the definitive catalog of all the characters in Unicode, along with standard properties for each.
Every character has a name, and you can learn what it is with the name function.
>>> import unicodedata
>>> unicodedata.name(u'a')
'LATIN SMALL LETTER A'
>>> unicodedata.name(u'\u1000')
'MYANMAR LETTER KA'
>>> unicodedata.name(u'\u00B0')
'DEGREE SIGN'
>>>
Notice that the names are returned as strings, not Unicode objects. All
Unicode character names use what you can informally call the ASCII subset.
You can basically reverse this operation, getting a Unicode character by
name, using the lookup function.
>>> unicodedata.lookup('DEGREE SIGN')
u'\xb0'
>>> unicodedata.lookup('LATIN SMALL LETTER A')
u'a'
>>>
You can really put this database to work giving your programs super duper
powers of globalization, head and shoulders above the rest. For example, did
you know that the characters "0" through "9" are not the only form of digits
used in writing? Even though these European digit characters derive from
historical Arabic number representations, modern Arabic scripts use a different
set of characters sometimes called "Indic numerals." (Although these are distinct
again from the digits used in modern-day scripts from India. Is your head
spinning, yet?) Unicode assigns these digits the appropriate decimal values,
and you can effortlessly derive the decimal value of any digit regardless
of script using the decimal function.
>>> unicodedata.decimal(u'0')
0
>>> unicodedata.decimal(u'\u0660')
0
>>> unicodedata.decimal(u'1')
1
>>> unicodedata.decimal(u'\u0661')
1
>>> #If you pass an invalid digit, it lets you know
>>> unicodedata.decimal(u'a')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: not a decimal
>>>
The digit and numeric functions are similar,
but there are some differences, and you should refer to the Unicode character
database for details (one obvious difference from the Python point of view
is that numeric returns floating point numbers). Unicode organizes
characters into categories, such as "Letter, Lowercase" (abbreviation "Ll"), "Symbol,
Currency" (abbreviation "Sc"), "Punctuation, Connector" (abbreviation "Pc"), "Right-to-Left
Arabic" (abbreviation "AL"), "Separator, Space" (abbreviation "Zs"), etc.
These categories are important for many character-processing cases. As an
example, you might want to be specific about what you mean by "white space" when
writing Unicode-aware applications. There are more than just the familiar
space, newline, carriage return and tab from ASCII, or nonbreaking space
from HTML. Interestingly, some of the characters we think of as spaces, such
as tab, are categorized as control codes in Unicode, and XML's own treatment
of characters often doesn't fall along neat lines of Unicode categories.
You can find the category of any character using the category function.
>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'\u00B0') #DEGREE SIGN
'So'
>>> unicodedata.category(u'\t')
'Cc'
>>> unicodedata.category(u'$')
'Sc'
>>>
There are other functions in unicodedata, but I'll leave them
to the reader's attentions.
From the Community
I mentioned the CJKV writing systems and encodings of the Pacific Rim in my last article. There are many non-Unicode character encodings in heavy use in these regions. There have been several third-party packages supporting these encodings, and Python 2.4 incorporates codecs based on a patch by Hye-Shik Chang. These support the following encodings:
- Chinese: gb2312, gbk, gb18030, big5hkscs, hz, big5, cp950
- Japanese: cp932, euc-jis-2004, euc-jp, euc-jisx0213, iso-2022-jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-3, iso-2022-jp-ext, iso-2022-jp-2004, shift-jis, shift-jisx0213, shift-jis-2004
- Korean: cp949, euc-kr, johab, iso-2022-kr
Python 2.4 also adds a few other non-CJK encodings, and I recommend that everyone who is serious about internationalization upgrade to this version as soon as possible.
Christof Hoeke has been busy lately. He has developed encutils for Python 0.2, which is a library for dealing with the encodings of files obtained over HTTP, including XML files. He does not yet implement an algorithm for sniffing an XML encoding from its declaration, but I expect he should be able to add this easily enough using the well-known algorithms for this task (notably the one described by John Cowan), which are the basis for this older Python cookbook recipe by Paul Prescod and this newer recipe by Lars Tiede. Christof also released pyxsldoc 0.69, "an application to produce documentation for XSLT files in XHTML format, similar to what javadoc does for Java files." See the announcements for encutils and pyxsldoc.
I discovered Ken Rimey's Personal Distributed Information Store (PDIS), which includes some XML tools for Nokia's Series 60 phones, which offer python support. This includes an XML parser based on PyExpat and an XPath implementation based on elementtree.
- Interesting series of articles on binding tools
2005-08-28 13:22:10 bob.hutchison - Interesting series of articles on binding tools
2005-08-30 14:56:03 Uche Ogbuji - Unicode character count
2005-08-01 08:56:16 Uche Ogbuji - Unicode and namespace support in EaseXML
2005-08-01 07:25:32 Philipss