|
The author mischaracterises Pull parsers, they do pretty much exactly what is described in this article - they don't "extract" strings at all unless you ask them to. Using StAX, exactly the desired pointers can be fetched like so:
XMLInputFactory f = XMLInputFactory.newInstance();
XMLEventReader r = f.createXMLEventReader(...);
while(r.hasNext()) {
XMLEvent e = r.nextEvent();
Location loc = e.getLocation();
// now have systemid, offset
}
However the problem isn't anything like as simple as described in this article anyway: parsing an xml document involves resolving entity references - which can synthesise markup, include other documents, and so on. The use of defaults values in both XML Schema and DTDs also means that the PSVI (the "deserialized" document data) contains items that have /no/ location in the original document. Pointers into the document just aren't enough, and /may not even be possible/.
In any case, some of the code in the article is wrong: ne_parseInt(String s1, int offset, int length) still requires extraction in java (strings always clone the content they are constructed with) - ne_parseInt(char[] s1, int offset, int length) would have been closer to the article's intent, but is still both inefficient and wrong: on large files this requires loading the entire document into memory instead of just "mmap"ping it (using nio in java), plus it treats xml as character data when in fact it's binary data (think about the byte order mark, the 'encoding="..."' attribute in the "xml" PI, character entities, etc: reading an xml file as an array of characters does not tell you the characters in the PSVI without more work); plus it ignores the fact that a field which is apparently an integer could look like "1<!-- 2 -->¬hing;<![CDATA[3]]>" (intended to be parsed as "13"), so pointing at ranges of chars from the original document for parsing numbers may not work at all.
The upshot is that what this article describes is only useful in a restricted subset of XML, and only when you care about the serialization of the XML not its meaning. The only apps I can think of that behave like this are editors? If its just about preserving things like comments, the XMLBeans api looks better to me; if its about performance, then StAX is better designed to understand all XML.
|