|
The fact that len(u'\U00010800') is 2 on a "UCS2" Python (which actually implements UTF16 rather than UCS2) *is not* a bug. It's correct behaviour, and if your code can't handle the fact that one user visible "character" (in Unicode speak, a grapheme cluster) equates to more than one code unit in a string, it's broken *anyway*.
Python makes this quite clear in the Language Reference, where it says:
"Surrogate pairs may be present in the Unicode object, and will be reported as two separate items."
(see <http://www.python.org/doc/current/ref/types.html#l2h-59>)
Why is your code broken if you rely on a one-to-one mapping? Well, it's true that UTF16 encodes Unicode code points using two 16-bit values, but even if you're using UCS4, you can still end up with one grapheme cluster equating to more than one in the string itself. For instance:
>>> s = u'\u0065\u0310'
>>> len(s)
2
>>> print s
e?
The above *should* render as a lower case letter 'e' with a candrabindu. len(s) is 2 for both "UCS2" and UCS4 Python.
I'll also add that compiling Python in UCS4 mode doesn't seem like a very clever idea. Yes, UCS4 has a one-to-one mapping for Unicode code points and code units in the string, but that doesn't actually gain you anything because of combining characters (apart, possibly, from a false sense of security), and it's extremely wasteful of memory not to mention being incompatible with the majority of Unicode library code (which, for the most part, represents strings in UTF16 internally). The only benefit, it seems to me, is that it might make Python Py_UNICODE characters the same as a wchar_t on some host platforms, but given the level of breakage in many wchar_t implementations I'm not sure whether that's really a good thing.
p.s. You may ask why I posted this here given that the article is so old. The reason is that it's the very first thing that came up when I searched for UCS4 Python. It is therefore important that it not give the wrong impression to the reader.
|