Sign In/My Account | View Cart  
advertisement

Article:
 Non-Extractive Parsing for XML
Subject: not new... or usable
Date: 2004-05-20 13:35:38
From: Richard Hough

I used an XML parser, called the Generic Restricted XML parser or something, several years back that did this. I did not find it very useful.


As other posters pointed out, it does not handle encodings or character entities. It does not do normalization, which we needed. It is also inefficient in dealing with large files which you only need a small portion of. These are the majority of XML files in my experience.


The suggestion to do such transformations "on the fly" is a poor one since it requires the processing to be done every time the element is used, rather than just once when the file is loaded. It will also not work if the element refers to external files, such as with XIncludes.


Previous Message Previous Message   Next Message Next Message


Titles Only Titles Only Newest First
  • not new... or usable
    2004-05-20 14:09:14 jimmy_z [Reply]

    Thanks for the question.


    For a network device, loading XML in a byte array is usually needed for store and forward applications, and is less CPU-intensive than parsing in general.


    Decoding on the fly is usually pretty simple, partly because most characters are ASCII ( I could be wrong, but that is my experience thus far)


    To test equality of two UCS2 characters
    String s1;
    String s2;
    if (s1.charAt(i) == s2.charAt(k))
    {}


    the code for testing the equality in "non-extractive" parsing is
    String s1;
    byte[] xml;
    if (s1.charAt(i) == xml[k])
    {
    }

    • not new... or usable
      2004-05-21 02:41:48 Brian Ewins [Reply]

      I'll bet you €1 that doesn't work. Nah, I'm feeling flush, lets make it £1.

      • not new... or usable
        2004-05-21 11:24:49 jimmy_z [Reply]

        Hi, € should work whether it is UTF-8 encoded
        or a character reference. I am not sure the other one.


        Hope I answered your question.


        thanks
        Jimmy

        • not new... or usable
          2004-05-23 02:46:20 Brian Ewins [Reply]

          "the code for testing the equality in "non-extractive" parsing is
          String s1;
          byte[] xml;
          if (s1.charAt(i) == xml[k])"..."€ should work whether it is UTF-8 encoded or a character reference. "


          Nope. Here's code to test it in java, the comparisons at the end are variations your theme.


          public class EuroTest {
          public static void main(String[] argv) throws UnsupportedEncodingException {
          byte[] utf8b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xAC};
          byte[] latin9b = new byte[] {(byte) 0xA4};
          String utf8 = new String(utf8b, 0, utf8b.length, "utf-8");
          String latin9 = new String(latin9b, 0, latin9b.length, "iso-8859-15");
          System.out.println(latin9 + "=" + utf8 + " " + latin9.equals(utf8));
          printComparison(latin9.charAt(0), utf8.charAt(0));
          printComparison(latin9.charAt(0), latin9b[0]);
          printComparison(latin9.charAt(0), utf8b[0]);
          printComparison(utf8.charAt(0), latin9b[0]);
          printComparison(utf8.charAt(0), utf8b[0]);
          }


          private static void printComparison(char c1, char c2) {
          System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(c2) + " ? " + (c1 == c2));
          }


          private static void printComparison(char c1, byte c2) {
          System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(0xFF & c2) + " ? " + (c1 == c2));
          }
          }


          The output:


          €=€ true
          20ac=20ac ? true
          20ac=a4 ? false
          20ac=e2 ? false
          20ac=a4 ? false
          20ac=e2 ? false


          in other words, a character-to-character comparison works, but a character-to-byte comparison doesn't. This is the case for any non-ascii character encoded in UTF-8, but its just the tip of the iceberg in character comparison problems, since UTF-8 permits multiple encodings of the same character.


          In your defence, a byte-to-char comparison isn't necessary for your non-extractive parser; but I've already pointed out other problems with that.

          • not new... or usable
            2004-05-23 11:42:10 jimmy_z [Reply]

            Sorry I misintepreted the previous message.


            for iso-8859-1
            the code for testing equality
            is


            if ( s1.charAt(i)== (0xff & xml[k]) ){
            }


            for handling character reference,
            the comparison function simply implements the
            behavior of treating "€" as an integer value 128.


            Thanks


Sponsored By: