Sign In/My Account | View Cart  
advertisement

Article:
 Non-Extractive Parsing for XML
Subject: not new... or usable
Date: 2004-05-21 02:41:48
From: Brian Ewins
Response to: not new... or usable

I'll bet you €1 that doesn't work. Nah, I'm feeling flush, lets make it £1.

No Previous Message Previous Message Move up to Parent Message Up Next Message No Next Message


Titles Only Titles Only Newest First
  • not new... or usable
    2004-05-21 11:24:49 jimmy_z [Reply]

    Hi, € should work whether it is UTF-8 encoded
    or a character reference. I am not sure the other one.


    Hope I answered your question.


    thanks
    Jimmy

    • not new... or usable
      2004-05-23 02:46:20 Brian Ewins [Reply]

      "the code for testing the equality in "non-extractive" parsing is
      String s1;
      byte[] xml;
      if (s1.charAt(i) == xml[k])"..."€ should work whether it is UTF-8 encoded or a character reference. "


      Nope. Here's code to test it in java, the comparisons at the end are variations your theme.


      public class EuroTest {
      public static void main(String[] argv) throws UnsupportedEncodingException {
      byte[] utf8b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xAC};
      byte[] latin9b = new byte[] {(byte) 0xA4};
      String utf8 = new String(utf8b, 0, utf8b.length, "utf-8");
      String latin9 = new String(latin9b, 0, latin9b.length, "iso-8859-15");
      System.out.println(latin9 + "=" + utf8 + " " + latin9.equals(utf8));
      printComparison(latin9.charAt(0), utf8.charAt(0));
      printComparison(latin9.charAt(0), latin9b[0]);
      printComparison(latin9.charAt(0), utf8b[0]);
      printComparison(utf8.charAt(0), latin9b[0]);
      printComparison(utf8.charAt(0), utf8b[0]);
      }


      private static void printComparison(char c1, char c2) {
      System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(c2) + " ? " + (c1 == c2));
      }


      private static void printComparison(char c1, byte c2) {
      System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(0xFF & c2) + " ? " + (c1 == c2));
      }
      }


      The output:


      €=€ true
      20ac=20ac ? true
      20ac=a4 ? false
      20ac=e2 ? false
      20ac=a4 ? false
      20ac=e2 ? false


      in other words, a character-to-character comparison works, but a character-to-byte comparison doesn't. This is the case for any non-ascii character encoded in UTF-8, but its just the tip of the iceberg in character comparison problems, since UTF-8 permits multiple encodings of the same character.


      In your defence, a byte-to-char comparison isn't necessary for your non-extractive parser; but I've already pointed out other problems with that.

      • not new... or usable
        2004-05-23 11:42:10 jimmy_z [Reply]

        Sorry I misintepreted the previous message.


        for iso-8859-1
        the code for testing equality
        is


        if ( s1.charAt(i)== (0xff & xml[k]) ){
        }


        for handling character reference,
        the comparison function simply implements the
        behavior of treating "€" as an integer value 128.


        Thanks


Sponsored By: