Sign In/My Account | View Cart  
advertisement

Article:
 Non-Extractive Parsing for XML
Subject: not new... or usable
Date: 2004-05-23 02:46:20
From: Brian Ewins
Response to: not new... or usable

"the code for testing the equality in "non-extractive" parsing is
String s1;
byte[] xml;
if (s1.charAt(i) == xml[k])"..."€ should work whether it is UTF-8 encoded or a character reference. "


Nope. Here's code to test it in java, the comparisons at the end are variations your theme.


public class EuroTest {
public static void main(String[] argv) throws UnsupportedEncodingException {
byte[] utf8b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xAC};
byte[] latin9b = new byte[] {(byte) 0xA4};
String utf8 = new String(utf8b, 0, utf8b.length, "utf-8");
String latin9 = new String(latin9b, 0, latin9b.length, "iso-8859-15");
System.out.println(latin9 + "=" + utf8 + " " + latin9.equals(utf8));
printComparison(latin9.charAt(0), utf8.charAt(0));
printComparison(latin9.charAt(0), latin9b[0]);
printComparison(latin9.charAt(0), utf8b[0]);
printComparison(utf8.charAt(0), latin9b[0]);
printComparison(utf8.charAt(0), utf8b[0]);
}


private static void printComparison(char c1, char c2) {
System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(c2) + " ? " + (c1 == c2));
}


private static void printComparison(char c1, byte c2) {
System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(0xFF & c2) + " ? " + (c1 == c2));
}
}


The output:


€=€ true
20ac=20ac ? true
20ac=a4 ? false
20ac=e2 ? false
20ac=a4 ? false
20ac=e2 ? false


in other words, a character-to-character comparison works, but a character-to-byte comparison doesn't. This is the case for any non-ascii character encoded in UTF-8, but its just the tip of the iceberg in character comparison problems, since UTF-8 permits multiple encodings of the same character.


In your defence, a byte-to-char comparison isn't necessary for your non-extractive parser; but I've already pointed out other problems with that.


No Previous Message Previous Message Move up to Parent Message Up Next Message No Next Message


Titles Only Titles Only Newest First
  • not new... or usable
    2004-05-23 11:42:10 jimmy_z [Reply]

    Sorry I misintepreted the previous message.


    for iso-8859-1
    the code for testing equality
    is


    if ( s1.charAt(i)== (0xff & xml[k]) ){
    }


    for handling character reference,
    the comparison function simply implements the
    behavior of treating "€" as an integer value 128.


    Thanks


Sponsored By: