|
Thanks for posting this question.
One way to deal with character encoding is to build "intelligence" into directly various non-extractive string comparison functions.
Most people are used to UCS-2 string representation in their code. So a "non-extractive" comparison function needs to compare UTF-8 tokens (or UTF-16) against UCS-2 strings.
In addition,it may also resolve entity references on the fly during the comparison.
Same thing applies to text to numerical data conversion as well. An non-extractive version of "parseInt" needs to convert a UTF-8 (or UTF-16) token into an integer without "extracting" it out of the source document.
Hope I answered your question.
|