Sign In/My Account | View Cart  
advertisement

Article:
 Non-Extractive Parsing for XML
Subject: not new... or usable
Date: 2004-05-20 14:09:14
From: jimmy_z
Response to: not new... or usable

Thanks for the question.


For a network device, loading XML in a byte array is usually needed for store and forward applications, and is less CPU-intensive than parsing in general.


Decoding on the fly is usually pretty simple, partly because most characters are ASCII ( I could be wrong, but that is my experience thus far)


To test equality of two UCS2 characters
String s1;
String s2;
if (s1.charAt(i) == s2.charAt(k))
{}


the code for testing the equality in "non-extractive" parsing is
String s1;
byte[] xml;
if (s1.charAt(i) == xml[k])
{
}


No Previous Message Previous Message Move up to Parent Message Up Next Message No Next Message


Titles Only Titles Only Newest First
  • not new... or usable
    2004-05-21 02:41:48 Brian Ewins

    I'll bet you €1 that doesn't work. Nah, I'm feeling flush, lets make it £1.

    • not new... or usable
      2004-05-21 11:24:49 jimmy_z

      Hi, € should work whether it is UTF-8 encoded
      or a character reference. I am not sure the other one.


      Hope I answered your question.


      thanks
      Jimmy

      • not new... or usable
        2004-05-23 02:46:20 Brian Ewins

        "the code for testing the equality in "non-extractive" parsing is
        String s1;
        byte[] xml;
        if (s1.charAt(i) == xml[k])"..."€ should work whether it is UTF-8 encoded or a character reference. "


        Nope. Here's code to test it in java, the comparisons at the end are variations your theme.


        public class EuroTest {
        public static void main(String[] argv) throws UnsupportedEncodingException {
        byte[] utf8b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xAC};
        byte[] latin9b = new byte[] {(byte) 0xA4};
        String utf8 = new String(utf8b, 0, utf8b.length, "utf-8");
        String latin9 = new String(latin9b, 0, latin9b.length, "iso-8859-15");
        System.out.println(latin9 + "=" + utf8 + " " + latin9.equals(utf8));
        printComparison(latin9.charAt(0), utf8.charAt(0));
        printComparison(latin9.charAt(0), latin9b[0]);
        printComparison(latin9.charAt(0), utf8b[0]);
        printComparison(utf8.charAt(0), latin9b[0]);
        printComparison(utf8.charAt(0), utf8b[0]);
        }


        private static void printComparison(char c1, char c2) {
        System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(c2) + " ? " + (c1 == c2));
        }


        private static void printComparison(char c1, byte c2) {
        System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(0xFF & c2) + " ? " + (c1 == c2));
        }
        }


        The output:


        €=€ true
        20ac=20ac ? true
        20ac=a4 ? false
        20ac=e2 ? false
        20ac=a4 ? false
        20ac=e2 ? false


        in other words, a character-to-character comparison works, but a character-to-byte comparison doesn't. This is the case for any non-ascii character encoded in UTF-8, but its just the tip of the iceberg in character comparison problems, since UTF-8 permits multiple encodings of the same character.


        In your defence, a byte-to-char comparison isn't necessary for your non-extractive parser; but I've already pointed out other problems with that.

        • not new... or usable
          2004-05-23 11:42:10 jimmy_z

          Sorry I misintepreted the previous message.


          for iso-8859-1
          the code for testing equality
          is


          if ( s1.charAt(i)== (0xff & xml[k]) ){
          }


          for handling character reference,
          the comparison function simply implements the
          behavior of treating "€" as an integer value 128.


          Thanks


Sponsored By: