"the code for testing the equality in "non-extractive" parsing is
String s1;
byte[] xml;
if (s1.charAt(i) == xml[k])"..." should work whether it is UTF-8 encoded or a character reference. "
Nope. Here's code to test it in java, the comparisons at the end are variations your theme.
public class EuroTest {
public static void main(String[] argv) throws UnsupportedEncodingException {
byte[] utf8b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xAC};
byte[] latin9b = new byte[] {(byte) 0xA4};
String utf8 = new String(utf8b, 0, utf8b.length, "utf-8");
String latin9 = new String(latin9b, 0, latin9b.length, "iso-8859-15");
System.out.println(latin9 + "=" + utf8 + " " + latin9.equals(utf8));
printComparison(latin9.charAt(0), utf8.charAt(0));
printComparison(latin9.charAt(0), latin9b[0]);
printComparison(latin9.charAt(0), utf8b[0]);
printComparison(utf8.charAt(0), latin9b[0]);
printComparison(utf8.charAt(0), utf8b[0]);
}
private static void printComparison(char c1, char c2) {
System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(c2) + " ? " + (c1 == c2));
}
private static void printComparison(char c1, byte c2) {
System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(0xFF & c2) + " ? " + (c1 == c2));
}
}
The output:
= true
20ac=20ac ? true
20ac=a4 ? false
20ac=e2 ? false
20ac=a4 ? false
20ac=e2 ? false
in other words, a character-to-character comparison works, but a character-to-byte comparison doesn't. This is the case for any non-ascii character encoded in UTF-8, but its just the tip of the iceberg in character comparison problems, since UTF-8 permits multiple encodings of the same character.
In your defence, a byte-to-char comparison isn't necessary for your non-extractive parser; but I've already pointed out other problems with that.
|