Sign In/My Account | View Cart  
advertisement

Article:
 Non-Extractive Parsing for XML
Subject: not new... or usable
Date: 2004-05-21 11:24:49
From: jimmy_z
Response to: not new... or usable

Hi, € should work whether it is UTF-8 encoded
or a character reference. I am not sure the other one.


Hope I answered your question.


thanks
Jimmy


No Previous Message Previous Message Move up to Parent Message Up Next Message No Next Message


Titles Only Titles Only Newest First
  • not new... or usable
    2004-05-23 02:46:20 Brian Ewins [Reply]

    "the code for testing the equality in "non-extractive" parsing is
    String s1;
    byte[] xml;
    if (s1.charAt(i) == xml[k])"..."€ should work whether it is UTF-8 encoded or a character reference. "


    Nope. Here's code to test it in java, the comparisons at the end are variations your theme.


    public class EuroTest {
    public static void main(String[] argv) throws UnsupportedEncodingException {
    byte[] utf8b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xAC};
    byte[] latin9b = new byte[] {(byte) 0xA4};
    String utf8 = new String(utf8b, 0, utf8b.length, "utf-8");
    String latin9 = new String(latin9b, 0, latin9b.length, "iso-8859-15");
    System.out.println(latin9 + "=" + utf8 + " " + latin9.equals(utf8));
    printComparison(latin9.charAt(0), utf8.charAt(0));
    printComparison(latin9.charAt(0), latin9b[0]);
    printComparison(latin9.charAt(0), utf8b[0]);
    printComparison(utf8.charAt(0), latin9b[0]);
    printComparison(utf8.charAt(0), utf8b[0]);
    }


    private static void printComparison(char c1, char c2) {
    System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(c2) + " ? " + (c1 == c2));
    }


    private static void printComparison(char c1, byte c2) {
    System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(0xFF & c2) + " ? " + (c1 == c2));
    }
    }


    The output:


    €=€ true
    20ac=20ac ? true
    20ac=a4 ? false
    20ac=e2 ? false
    20ac=a4 ? false
    20ac=e2 ? false


    in other words, a character-to-character comparison works, but a character-to-byte comparison doesn't. This is the case for any non-ascii character encoded in UTF-8, but its just the tip of the iceberg in character comparison problems, since UTF-8 permits multiple encodings of the same character.


    In your defence, a byte-to-char comparison isn't necessary for your non-extractive parser; but I've already pointed out other problems with that.

    • not new... or usable
      2004-05-23 11:42:10 jimmy_z [Reply]

      Sorry I misintepreted the previous message.


      for iso-8859-1
      the code for testing equality
      is


      if ( s1.charAt(i)== (0xff & xml[k]) ){
      }


      for handling character reference,
      the comparison function simply implements the
      behavior of treating "€" as an integer value 128.


      Thanks


Sponsored By: