Sign In/My Account | View Cart  
advertisement

Article:
 Non-Extractive Parsing for XML
Subject: Many misunderstandings...
Date: 2004-05-20 08:22:25
From: Brian Ewins

The author mischaracterises Pull parsers, they do pretty much exactly what is described in this article - they don't "extract" strings at all unless you ask them to. Using StAX, exactly the desired pointers can be fetched like so:


XMLInputFactory f = XMLInputFactory.newInstance();
XMLEventReader r = f.createXMLEventReader(...);
while(r.hasNext()) {
XMLEvent e = r.nextEvent();
Location loc = e.getLocation();
// now have systemid, offset
}


However the problem isn't anything like as simple as described in this article anyway: parsing an xml document involves resolving entity references - which can synthesise markup, include other documents, and so on. The use of defaults values in both XML Schema and DTDs also means that the PSVI (the "deserialized" document data) contains items that have /no/ location in the original document. Pointers into the document just aren't enough, and /may not even be possible/.


In any case, some of the code in the article is wrong: ne_parseInt(String s1, int offset, int length) still requires extraction in java (strings always clone the content they are constructed with) - ne_parseInt(char[] s1, int offset, int length) would have been closer to the article's intent, but is still both inefficient and wrong: on large files this requires loading the entire document into memory instead of just "mmap"ping it (using nio in java), plus it treats xml as character data when in fact it's binary data (think about the byte order mark, the 'encoding="..."' attribute in the "xml" PI, character entities, etc: reading an xml file as an array of characters does not tell you the characters in the PSVI without more work); plus it ignores the fact that a field which is apparently an integer could look like "1<!-- 2 -->¬hing;<![CDATA[3]]>" (intended to be parsed as "13"), so pointing at ranges of chars from the original document for parsing numbers may not work at all.


The upshot is that what this article describes is only useful in a restricted subset of XML, and only when you care about the serialization of the XML not its meaning. The only apps I can think of that behave like this are editors? If its just about preserving things like comments, the XMLBeans api looks better to me; if its about performance, then StAX is better designed to understand all XML.



Previous Message Previous Message   Next Message Next Message


Titles Only Titles Only Newest First
  • Many misunderstandings...
    2004-05-22 13:42:49 mweiher [Reply]

    You are not abstracting. You can hide an optimized implementation behind an interface that looks just like a regular string. In fact, this is exactly what MPWXmlKit does. To the client, it just provides NSString-compatible objects. Inside they are highly optimized and just point inisde the original data, but if something happens that can't be handled that way, it can resort to returning a more "normal" NSString.


    Polymorphism is a wonderful thing ;-)


    Marcel


    • Many misunderstandings...
      2004-05-23 03:14:53 Brian Ewins [Reply]

      Unfortunately in this case, its not me thats not abstracting - its Sun. Yeah, sure you can do all kinds of funky things behind an interface, but String is a final class in java (ie you can't implement your own or subclass it, it just is what it is; hey it wasn't my idea!).


      ne_parseInt(String s1, int offset, int length) still isn't the signature to replace "Java's parseInt", if you want to avoid extraction. ne_parseInt(CharSequence s1, int offset, int length), possibly.

      • Many misunderstandings...
        2004-05-23 10:57:14 mweiher [Reply]

        So Java is broken...tell me something I don't know ;-)


        MPWXmlKit isn't implemented in Java, and the examples posted in the article aren't Java either...so how does Java enter into the equation?


        Marcel


        • Many misunderstandings...
          2004-05-23 11:41:15 Brian Ewins [Reply]

          It enters because that's what Jimmy was writing about - a java replacement for a java API. Quoting the article:


          "Yet most string-to-data conversion macros or functions, e.g. atoi, atof and Java's parseInt assume tokens in the "extractive" sense. To support the new "non-extractive" tokenization, one can create a mirror set of functions, e.g. ne_atoi, ne_atof and ne_parseInt (ne stands for non-extractive). "


          thats why I put quotes around "Java's parseInt" in the previous reply - I'd spotted your stuff was in Objective-C, I know you can do better :)


          In java even the supposedly high-performance 'standard' APIs (like SAX) use 'String' everywhere, and so can't be zero-copy like Jimmy's proposal; we have to resort to other hacks - eg I wrote a SAX parser once that used a ternary tree-based StringPool for element/attribute names, it massively reduced the number of strings being created - most parsers use java's String.intern() to pool strings instead, which creates masses of garbage to collect.


          Anyway I think I've written enough replies to this article... back to work...

          • Many misunderstandings...
            2004-05-23 13:37:45 mweiher [Reply]

            "It enters because that's what Jimmy was writing about - a java replacement for a java API. Quoting the article:"


            Huh?? He talks about XML/string parsing in general, not about Java in particular. Certainly the references to atoi(), lex, Macros and C are a pretty strong hint that it's not just Java we're talking about here...


            But not really important, I think we understand each other :-))


            Cheers,


            Marcel


            • Many misunderstandings...
              2004-06-22 12:57:15 jimmy_z [Reply]

              Marcel,


              We (XimpleWare) just released our XML processing software and it is at vtd-xml.sf.net. I would like to personally invite you to take a look. Your input and suggestions are very welcome.


              Cheers
              Jimmy

  • Many misunderstandings...
    2004-05-20 11:35:21 jimmy_z [Reply]

    Thanks for posting the question.


    I am not particularly familiar with StAX, so there might be ways to refer to tokens after the StAX parser consumes the entire document.


    For "non-extractive" parsing, encoded characters can be decoded on the fly when compared against a Java string.


    Entities (especially built-in ones) can also be resolved on the fly during comparison.

    For "ne_parseInt(String s1, int offset, int length)," one may get the reference to the character array internally or retrieve individual character by the member methods charAt(int).



Sponsored By: