Sign In/My Account | View Cart  
advertisement

Article:
 Non-Extractive Parsing for XML
Subject: Many misunderstandings...
Date: 2004-05-20 08:22:25
From: Brian Ewins

The author mischaracterises Pull parsers, they do pretty much exactly what is described in this article - they don't "extract" strings at all unless you ask them to. Using StAX, exactly the desired pointers can be fetched like so:


XMLInputFactory f = XMLInputFactory.newInstance();
XMLEventReader r = f.createXMLEventReader(...);
while(r.hasNext()) {
XMLEvent e = r.nextEvent();
Location loc = e.getLocation();
// now have systemid, offset
}


However the problem isn't anything like as simple as described in this article anyway: parsing an xml document involves resolving entity references - which can synthesise markup, include other documents, and so on. The use of defaults values in both XML Schema and DTDs also means that the PSVI (the "deserialized" document data) contains items that have /no/ location in the original document. Pointers into the document just aren't enough, and /may not even be possible/.


In any case, some of the code in the article is wrong: ne_parseInt(String s1, int offset, int length) still requires extraction in java (strings always clone the content they are constructed with) - ne_parseInt(char[] s1, int offset, int length) would have been closer to the article's intent, but is still both inefficient and wrong: on large files this requires loading the entire document into memory instead of just "mmap"ping it (using nio in java), plus it treats xml as character data when in fact it's binary data (think about the byte order mark, the 'encoding="..."' attribute in the "xml" PI, character entities, etc: reading an xml file as an array of characters does not tell you the characters in the PSVI without more work); plus it ignores the fact that a field which is apparently an integer could look like "1<!-- 2 -->¬hing;<![CDATA[3]]>" (intended to be parsed as "13"), so pointing at ranges of chars from the original document for parsing numbers may not work at all.


The upshot is that what this article describes is only useful in a restricted subset of XML, and only when you care about the serialization of the XML not its meaning. The only apps I can think of that behave like this are editors? If its just about preserving things like comments, the XMLBeans api looks better to me; if its about performance, then StAX is better designed to understand all XML.



Previous Message Previous Message   Next Message Next Message


Titles Only Full Threads Newest First
  • Many misunderstandings...
    2004-05-22 13:42:49 mweiher [Reply]

    You are not abstracting. You can hide an optimized implementation behind an interface that looks just like a regular string. In fact, this is exactly what MPWXmlKit does. To the client, it just provides NSString-compatible objects. Inside they are highly optimized and just point inisde the original data, but if something happens that can't be handled that way, it can resort to returning a more "normal" NSString.


    Polymorphism is a wonderful thing ;-)


    Marcel


  • Many misunderstandings...
    2004-05-20 11:35:21 jimmy_z [Reply]

    Thanks for posting the question.


    I am not particularly familiar with StAX, so there might be ways to refer to tokens after the StAX parser consumes the entire document.


    For "non-extractive" parsing, encoded characters can be decoded on the fly when compared against a Java string.


    Entities (especially built-in ones) can also be resolved on the fly during comparison.

    For "ne_parseInt(String s1, int offset, int length)," one may get the reference to the character array internally or retrieve individual character by the member methods charAt(int).



Sponsored By: