Non-Extractive Parsing for XML
Introduction
Text processing is one of the most common tasks in application development. Whether it is a Java Servlet or a VOIP application, the conversion from a raw text-based input message to a machine-readable internal representation almost always requires parsing (or tokenization), which, in its current form, refers to a process of extracting tokens (the smallest unit of relevent textual information) and storing them in null-terminated memory blocks, also known as strings. Over the years, people have invented various automation techniques and tools, e.g. regular expression and Lex, to reduce the complexity of manual parsing. Proven both useful and stable, those techniques and tools have stood the test of time. As a result, the current framework of text processing is generally considered to be fairly well-established.
In this article, I propose an alternative to existing "extractive" style of parsing that sits at the lowest layer of the text processing "stack." Furthermore, I intend to show that an arguably small perspective change on the representation of a token can lead to some interesting properties not readily available from today's XML processing techniques. In fact, applying this technique to XML processing can yield some of the advantages being promoted by "binary XML" advocates while still retaining the XML 1.x textual format.
Text Processing: A Non-extractive style
|
Related Reading
|
As the logical first step, people use a piece of code called a "tokenizer" to split input text into little tokens. Under the traditional text processing framework, the tokenization step usually involves the change of storage location of the token content. One usually performs the following steps to achieve the purpose of tokenization in C:
-
Detect the token length (n) in the source document
-
Dynamically allocate a small memory block of the size n+1.
-
Copy the content of the token from the source document to the allocated memory block and terminate the token string with a null character
-
(Typically) discard the source document
Consider the following snippet of an HTTP header as an example.
Accept: */*
Accept-Language: en-us
Connection: Keep-Alive
Host: localhost
Referer: http://localhost/links.asp
After the tokenization step, we have created ten null-terminated string objects, each representing a field in this HTTP header fragment. From this point on, subsequent post processing steps requiring read access would assume those tokens to be the atomic information storage units.
Accept\0"
"*/*\0"
"Accept-Language\0"
"en-us\0"
"Connection\0"
"Keep-Alive\0"
"Host\0"
"localhost\0"
"Referer\0"
"http://localhost/links.asp\0"
However, we would like to observe that, with some help from pointer arithmetic, the aforementioned tokenization step isn't necessarily the only way for one to gain read-access to the content of a token. Rather than extracting tokens out of the source document, one can instead choose to record a pair of integers representing the starting offset value and length for each token as appeared in the source document, while maintaining the source document in memory.
Using the same HTTP header fragment, the new output, consisting of the source document and an array of integers, is shown below. As an example, we use the first pair of integers (0, 6) to describe the token "Accept", indicating that it starts at offset 0 of the source document, and is 6 characters in length.
Accept: */*
Accept-Language: en-us
Connection: Keep-Alive
Host: localhost
Referer: http://localhost/links.asp
0, 6
9, 3
13, 15
30, 5
36, 10
48, 10
59, 4
65, 9
75, 7
84, 25
This "non-extractive" style of tokenization has some interesting properties:
-
One can remove, add or update any token without reserializing the entire file. Using the traditional "extractive" style of parsing, one may need to perform numerous string concatenations in order to compose the updated document. The new "non-extractive" style allows one to directly copy the content from unmodified regions. In the example above, if one wants to replace the token "
localhost" whose offset and length values are 65 and 7 respectively, one simply:a) copies the content from the beginning to offset value of 64,
b) appends the new token content, then
c) completes the operation by appending the remaining "unchanged" content in the source document starting from offset value of 74 (65+9).
When there are a large number of tokens in the source document, the performance improvement resulting from this non-extractive style of parsing can be quite significant as compared to the extractive style.
-
One can remove, add or update a group of adjacent tokens without reserializing the entire file. Similar to #1, one can treat a group of adjacent tokens as a single large token by making calculations based on the starting offset and length values of the first and last token of the group.
-
One can address any fragment of the source message by a pair of integer (offset + length). Because a fragment in a source message consists of a group of adjacent tokens, one can address, or even take out, the fragment from the source message by calculating, the starting offset and length of the fragment (as in #2).
-
With #3, one can efficiently pull out and splice together fragments from multiple source messages to compose new messages. An interesting application of #3 is that one can efficiently compose a new document by appending together fragments of several documents, as long as each document is tokenized "non-extractively."
-
The original text messages is fully- preserved and dealing with whitespace is easier. Because one maintains the original document in memory, no information is lost after the "non-extractive" tokenization step. To remove the leading or trailing whitespace, one simply recalculates the starting offset and length of a "trimmed" token.
-
One can write a pair of integers for every token in the source message into a binary file that can be persisted on disk. To understand this property, observe that "extractive" tokenization requires the use of absolute pointers that are difficult to be made persistent, whereas the "non-extractive" tokenization uses "soft" pointers (starting offset) and relies on runtime pointer arithmetic to gain access to the token content.
-
With #6, pre-parsing a text message or "parse once, use many times" become possible. When you load both the source message and the binary file (containing offset and length of tokens), you don't have to parse every time you process the source message in a read-only manner. Its immediate benefit is that one can generate the binary file and save it along with the source document prior to any ensuing post-processing.
Does It Work?
Let's examine how non-extractive parsing works in practice. As an
example, consider how to compare a token with a known string value.
In C's <string.h> library (shown below), we can
straightforwardly use the function "strncmp()"
for that purpose.
#include <string.h>
int strncmp(const char *s1, const char *s2, size_t n);
Suppose sm points to the source message and
s1 points to a known string value to be compared
against, one simply sets the value s2 to the value
(sm + starting_offset),
and n to the length of the token. In other words, the
support of comparison for both "extractive" tokenization
(i.e. strcmp) and "non-extractive"
tokenization (i.e. strncmp) actually exists since very
early. Yet most string-to-data conversion macros or functions,
e.g. atoi, atof and Java's
parseInt assume tokens in the "extractive" sense. To
support the new "non-extractive" tokenization, one can
create a mirror set of functions,
e.g. ne_atoi, ne_atof
and ne_parseInt (ne stands
for non-extractive). We compare function signatures
between those two functions sets in the table below.
| Macros/Functions to work with "Extractive" Tokenization | Macros/Functions to work with Non-Extractive Tokenization |
atoi(char* s1)
|
ne_atoi(char* sm, int offset, int
length)
|
atof(char* s1)
|
ne_atoi(char* sm, int offset, int
length)
|
parseInt(String s1)
|
ne_parseInt(String s1, int offset,
int length)
|
Implication for XML Processing
Because XML is a text-based markup language, a "non-extractive" style of XML processing will possess the following properties:
-
One can remove, add, or update any token in an XML document without reserializing the entire document. Notice that this comes straight from the first property for "non-extractive" parsing. Using current "extractive" XML processors, such as DOM, SAX or PULL, a simple update can potentially require that the entire document be taken apart into tokens before putting those tokens back. When most part of the document isn't relevant to the modification, redundant re-serialization can incur significant performance overhead.
-
One can remove, add or update any element without reserializing the whole thing every time. An element in an XML document consists of a group of adjacent tokens, such as the starting/ending tags, its text content, attributes and child elements. Applying "non-extractive" tokenization allows one to, similar to #1, avoid reserialization on irrelevant parts of the document associated with "extractive" tokenization. This, again, buys performance.
-
One can address any fragments of an XML document by a pair of integers (offset + length). Those two integers can be calculated from the offset and length of both the first and the last tokens in the fragment. Also the fragment "descriptor" (consisting of those two integers) is persistent, meaning that one can potentially save the descriptor on disk for later use.
-
With # 3, one can efficiently pull out and splice together fragments from multiple XML messages to compose new XML document. Assuming we have a fragment "descriptor" for each document of interest, we can efficiently compose a new XML document by concatenating those fragments like buffers.
-
The original XML messages is fully-preserved and dealing with whitespace is easier. Whitespace handling has been source of controversy among various "extractive" style XML processor implementations. A non-extractive XML processor has the option that extractive ones don't have: Because a non-extractive XML processor maintains the source document in memory, it allows one to recover insignificant whitespace in the worst case.
-
One can write a pair of integers for every token in the XML document into a binary file that can be persisted on disk. In other words, one now has the option to persist the "parsed state" on disk.
-
With #6, preparsing an XML message or "parse once, use many times" becomes possible. When one loads in memory both the XML document and the binary, he can directly process the XML document. If any subsequent processing only involves read access, then one can potentially parse the document once and use it many times. Imagine an XML document tagged along with the parsed state can potentially help reduce the processing overhead incurred by network hops/intermediaries.
In addition, one can combine some of the above properties to create new properties and applications. In his article, Steven J. Vaughan-Nichols wrote that "... Yet another problem is that there's no way to pull out a single string of data, say the contents of a field, from an XML document without having to retrieve and parse the entire document." By combining #1, #6 and #7, one comes up a pretty good solution to the problem.
"Binary-Enhanced" XML
There has been ongoing debate on the topic of binary XML as manifested by the binary XML workshop held by W3C last September. Poor performance of XML processing is often mentioned as one of the reasons behind the recent effort to look for an optimized infoset serialization format. One way to look at the "parse once, use many times" property afforded by a non-extractive parsing style is that it can potentially provide a pre-parsed XML that is also backward compatible with XML--in other words, one loses no advantages inherent to XML, e.g. flexibility and human readability. As an example, an XML search engine using "non-extractive" style pre-parsed XML could allow users to query a large set of XML documents across the Internet using XPath at significantly better throughput than today's XML processing technology, even higher than many proposed binary XML formats.
Summary
I have proposed a new "non-extractive" style of tokenization for text processing purposes. I've also outlined a few of its properties in the context of XML processing. As XML finds more uses as an open document/data encoding format, people will increasingly demand more efficient and sophisticated XML processing capabilities. Although the basic concept of non-extractive parsing is simple, it can potentially lead to a new direction in thinking of XML processing.
| Titles Only | Titles Only | Newest First |
- Working Example
2005-03-14 03:46:48 Julian Turner
I independently happened upon non-extractive thinking, when my web hosting company did not provide an XML parser component.
I wrote my own parse, and mindful of the need to avoid server overheads, came up with the idea of a simple string based XML parser in Javascript, where effectively the XML is retained as a single string, and each "node" you create identifies a start and end point in the string.
I.e. the document is not parsed. What happens is that if you call say "firstChild" the method will use string functions to find the start and end of the the node, and then return a node object. So node objects are only created as you need them.
Because Javascript does not support pointers, the main processing overhead is the need to update these start and end points (if you are doing editing).
If you are doing simple reading, then the only overhead is storing the XML string in memory, which is considerably less demanding than parsing the document.
RegExp patterns are used to implement simple node searching.
See my web site www.baconbutty.com which is based on this + HTTPRequest.
- Buffering document?
2004-06-20 03:26:52 olegtkachenko
Well, another problem is potentially huge memory consumption in streaming scenarios - when input document is represented as forward-only I/O stream.
While SAX and XmlReader in .NET handle it very well, NE parser would need to buffer the entire XML document.
- Buffering document?
2004-06-20 13:26:05 jimmy_z
Also we just released our software under GPL.
it is at http://vtd-xml.sf.net
It is written in Java.
- Buffering document?
2004-06-20 13:22:54 jimmy_z
Thanks for the posting.
NE parsing is more like DOM than SAX, and isn't primiarily intended for streaming parsing, for the reason you point out.
- Buffering document?
- BEWARE: DANGEROUS ERRORS IN ARTICLE!
2004-05-31 07:33:34 OskariOlematon
Hi,
I would like to point out that the coding examples in the article contain at least two very elemental but dangerous C library usage errors. Please do something about them so that such mistakes are not copied by other inexperienced C programmers. The errors are related to C library functions strncmp() and atoi().
strncmp() is a very dangerous function which is better avoided if possible. If not possible, then it should be wrapped to make it easier to use. The error this article propagates is the failure to understand that even a partial match is considered a match. As an example, the following comparison reports the compared strings as equal:
strncmp("tokenNOT!", "...token..." + 3, 5)
This is exactly how the article proposes to compare a known string to a token. In this case they are erroneously reported as equal. Please don't do this! One must consider the whole known string.
The other error is to advocate the use of atoi(). atoi() should not be used because it offers no error detection what so ever. Use strtol() or something else instead.
These kinds of errors a the worst because they are very difficult to find in testing. They can go hiding for years and then emerge to bring your software down once it has been installed at hundreds of client sites. I sincerely wish the article was edited to correct these errors.
Best Regards,
Semi
- BEWARE: DANGEROUS ERRORS IN ARTICLE!
2004-06-01 14:07:00 Michael Maron
Normally, a token is defined something like a sequence of symbols from the certain symbol range, for example, a-zA-Z0-9. All other symbols are considered as token separators. This is exactly how regexps work. So, I really don't know what is the point of using partial rather than complete comparisons.
What is also highly strange is that the article considers an HTTP header, not an XML document as an example:
Consider the following snippet of an HTTP header as an example.
Accept: */*
Accept-Language: en-us
Connection: Keep-Alive
Host: localhost
Referer: http://localhost/links.asp
<question remark="shrugging shoulders">
What all this has to do with XML?
</question>
- BEWARE: DANGEROUS ERRORS IN ARTICLE!
2004-06-01 14:12:16 jimmy_z
The reason I used the example of HTTP is because the representation of token is one layer below XML parsing; in other words, it applies to many types of text processing.
Thanks,
Jimmy
- BEWARE: DANGEROUS ERRORS IN ARTICLE!
2004-06-01 16:45:12 Michael Maron
Well, it would sound reasonable to talk about XML processing using XML documents as examples. As for HTTP headers, working with them is a completely different task from XML document processing.
Best,
Michael
- BEWARE: DANGEROUS ERRORS IN ARTICLE!
- BEWARE: DANGEROUS ERRORS IN ARTICLE!
- BEWARE: DANGEROUS ERRORS IN ARTICLE!
2004-05-31 10:44:19 jimmy_z
Thanks for the posting.
Yes, you have some very good points. The purpose of using those example is to demonstrate the concept, not necessarily to offer details of implementation.
In fact there are lots of things uncovered by this article:
1. encoding: how to compare a UCS-2 string to a UTF-8 or UTF-16 encoded token
2. entity reference: to compare a UCS-2 string to a token containing entities such as (#&s;)
3. Normalized compare: to compare a UCS-2 string against the normalized UC2 view of the token
overall, I am only trying to offer a starting point, definitely not a whole picture.
Thanks again,
Jimmy
- BEWARE: DANGEROUS ERRORS IN ARTICLE!
- Wrong approach
2004-05-28 16:00:36 Michael Maron
One simple question: why would anybody want to fight with basic ideas of XML in the first place? As far as C++ and especially C are concerned, the answer is obvious: working with XML in C/C++ world is not particularly convenient, so one can look for simpler alternatives to standard parsers - like one considered in this article.
From the other side, regular expressions actually can provide a reasonable alternative to full-scale parsing. Unfortunately, regexps are not common for C/C++ as well, so this is hardly a way out.
But in Perl and in Java regexps are ceratinly very useful for simple XML manipualtions. For example, some XML converters can produce non-valid XML documents which can be transformed to valid ones using Perl or Java regexps.
Another serious alternative to XML DOM and SAX is JDBC in Java.
- Wrong approach
2004-05-29 11:43:18 jimmy_z
Thanks for the posting.
The concept of non-extractive parsing is meant to be a general-purpose alternative to extractive parsing; in other words, it is one layer below reg expression and is not tied to any specific language.
- C regexps and XML binaries
2004-05-29 21:28:26 Michael Maron
Just did a Google search on C regexps. Apparently, there are some packages available, for example http://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/regexp/ Basically, this resoves the problem of simple regexp-based XML parsing.
Now as far as binaries are concerned, it is really hard to see any reasonable justifications to bypass the regular XMK approach.
First way to deal with binaries is to keep them as regular files and refer to them by hyperlinks. Seconldly, images can be kept as blobs in the SQL database and extracted by request.
Best, Michael
- C regexps and XML binaries
- C++ / .NET
2004-05-29 06:39:35 Michael Maron
Sure, one can use C++ to work with XML - in .NET environment.
Another point is that C++ and even C certainly can be used to generate new XML documents from scratch, for example, by simple sprintf(). As for extraction, no, using string manipulation functions is a really miserable solution.
- Wrong approach
- MPWXmlKit has done this for over 5 years
2004-05-22 13:34:36 mweiher
Along with a very fast low-level scanner and special optimizations for CDATA sections, this allowed a parser with a pretty high-level interface (all data provided as Objective-C objects) to compare favorably in speed with the fastest plain C parsers.
The primary reason I wanted this was in order to be able to do partial processing, dealing only with certain sections and effectively + correctly ignoring the rest with both very high speed and bit-by-bit fidelity.
The encoding issues can be dealt with very elegantly the same way the NSString class cluster deals with them: specific subclasses can handle different encoding (styles), and remember what their encoding is. Strings with compatible/identical encodings can just ignore the issue, incompatible strings need to go through a canonical encoding (unicode).
MPWXmlKit can be downloaded from www.metaobject.com.
Keep in mind though, that this is mostly an internal parser trying to deal with specialized requirements, it doesn't have all the features todays parser have, because I have no no need for them.
Marcel
- MPWXmlKit has done this for over 5 years
2004-05-22 15:26:38 jimmy_z
Thanks for the posting.
Using Offset/length approach, one usually loads entire doc in memory undecoded, and doesn't use byte stream to access the document.
Hope using offset/length can complement DOM/SAX in some situations.
- MPWXmlKit has done this for over 5 years
2004-05-23 13:47:05 mweiher
"Hope using offset/length can complement DOM/SAX in some situations."
Hmm...you can actually implement a SAX compatible interface using offset/length (except that you deliver objects for character data, not character arrays). And a DOM on top of that.
Cheers,
Marcel
- MPWXmlKit has done this for over 5 years
- MPWXmlKit has done this for over 5 years
- not new... or usable
2004-05-20 13:35:38 Richard Hough
I used an XML parser, called the Generic Restricted XML parser or something, several years back that did this. I did not find it very useful.
As other posters pointed out, it does not handle encodings or character entities. It does not do normalization, which we needed. It is also inefficient in dealing with large files which you only need a small portion of. These are the majority of XML files in my experience.
The suggestion to do such transformations "on the fly" is a poor one since it requires the processing to be done every time the element is used, rather than just once when the file is loaded. It will also not work if the element refers to external files, such as with XIncludes.
- not new... or usable
2004-05-20 14:09:14 jimmy_z
Thanks for the question.
For a network device, loading XML in a byte array is usually needed for store and forward applications, and is less CPU-intensive than parsing in general.
Decoding on the fly is usually pretty simple, partly because most characters are ASCII ( I could be wrong, but that is my experience thus far)
To test equality of two UCS2 characters
String s1;
String s2;
if (s1.charAt(i) == s2.charAt(k))
{}
the code for testing the equality in "non-extractive" parsing is
String s1;
byte[] xml;
if (s1.charAt(i) == xml[k])
{
}
- not new... or usable
2004-05-21 02:41:48 Brian Ewins
I'll bet you 1 that doesn't work. Nah, I'm feeling flush, lets make it £1.
- not new... or usable
2004-05-21 11:24:49 jimmy_z
Hi, should work whether it is UTF-8 encoded
or a character reference. I am not sure the other one.
Hope I answered your question.
thanks
Jimmy
- not new... or usable
2004-05-23 02:46:20 Brian Ewins
"the code for testing the equality in "non-extractive" parsing is
String s1;
byte[] xml;
if (s1.charAt(i) == xml[k])"..." should work whether it is UTF-8 encoded or a character reference. "
Nope. Here's code to test it in java, the comparisons at the end are variations your theme.
public class EuroTest {
public static void main(String[] argv) throws UnsupportedEncodingException {
byte[] utf8b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xAC};
byte[] latin9b = new byte[] {(byte) 0xA4};
String utf8 = new String(utf8b, 0, utf8b.length, "utf-8");
String latin9 = new String(latin9b, 0, latin9b.length, "iso-8859-15");
System.out.println(latin9 + "=" + utf8 + " " + latin9.equals(utf8));
printComparison(latin9.charAt(0), utf8.charAt(0));
printComparison(latin9.charAt(0), latin9b[0]);
printComparison(latin9.charAt(0), utf8b[0]);
printComparison(utf8.charAt(0), latin9b[0]);
printComparison(utf8.charAt(0), utf8b[0]);
}
private static void printComparison(char c1, char c2) {
System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(c2) + " ? " + (c1 == c2));
}
private static void printComparison(char c1, byte c2) {
System.out.println(Integer.toHexString(c1) + "=" + Integer.toHexString(0xFF & c2) + " ? " + (c1 == c2));
}
}
The output:
= true
20ac=20ac ? true
20ac=a4 ? false
20ac=e2 ? false
20ac=a4 ? false
20ac=e2 ? false
in other words, a character-to-character comparison works, but a character-to-byte comparison doesn't. This is the case for any non-ascii character encoded in UTF-8, but its just the tip of the iceberg in character comparison problems, since UTF-8 permits multiple encodings of the same character.
In your defence, a byte-to-char comparison isn't necessary for your non-extractive parser; but I've already pointed out other problems with that.
- not new... or usable
2004-05-23 11:42:10 jimmy_z
Sorry I misintepreted the previous message.
for iso-8859-1
the code for testing equality
is
if ( s1.charAt(i)== (0xff & xml[k]) ){
}
for handling character reference,
the comparison function simply implements the
behavior of treating "" as an integer value 128.
Thanks
- not new... or usable
- not new... or usable
- not new... or usable
- not new... or usable
- not new... or usable
- Many misunderstandings...
2004-05-20 08:22:25 Brian Ewins
The author mischaracterises Pull parsers, they do pretty much exactly what is described in this article - they don't "extract" strings at all unless you ask them to. Using StAX, exactly the desired pointers can be fetched like so:
XMLInputFactory f = XMLInputFactory.newInstance();
XMLEventReader r = f.createXMLEventReader(...);
while(r.hasNext()) {
XMLEvent e = r.nextEvent();
Location loc = e.getLocation();
// now have systemid, offset
}
However the problem isn't anything like as simple as described in this article anyway: parsing an xml document involves resolving entity references - which can synthesise markup, include other documents, and so on. The use of defaults values in both XML Schema and DTDs also means that the PSVI (the "deserialized" document data) contains items that have /no/ location in the original document. Pointers into the document just aren't enough, and /may not even be possible/.
In any case, some of the code in the article is wrong: ne_parseInt(String s1, int offset, int length) still requires extraction in java (strings always clone the content they are constructed with) - ne_parseInt(char[] s1, int offset, int length) would have been closer to the article's intent, but is still both inefficient and wrong: on large files this requires loading the entire document into memory instead of just "mmap"ping it (using nio in java), plus it treats xml as character data when in fact it's binary data (think about the byte order mark, the 'encoding="..."' attribute in the "xml" PI, character entities, etc: reading an xml file as an array of characters does not tell you the characters in the PSVI without more work); plus it ignores the fact that a field which is apparently an integer could look like "1<!-- 2 -->¬hing;<![CDATA[3]]>" (intended to be parsed as "13"), so pointing at ranges of chars from the original document for parsing numbers may not work at all.
The upshot is that what this article describes is only useful in a restricted subset of XML, and only when you care about the serialization of the XML not its meaning. The only apps I can think of that behave like this are editors? If its just about preserving things like comments, the XMLBeans api looks better to me; if its about performance, then StAX is better designed to understand all XML.
- Many misunderstandings...
2004-05-22 13:42:49 mweiher
You are not abstracting. You can hide an optimized implementation behind an interface that looks just like a regular string. In fact, this is exactly what MPWXmlKit does. To the client, it just provides NSString-compatible objects. Inside they are highly optimized and just point inisde the original data, but if something happens that can't be handled that way, it can resort to returning a more "normal" NSString.
Polymorphism is a wonderful thing ;-)
Marcel
- Many misunderstandings...
2004-05-23 03:14:53 Brian Ewins
Unfortunately in this case, its not me thats not abstracting - its Sun. Yeah, sure you can do all kinds of funky things behind an interface, but String is a final class in java (ie you can't implement your own or subclass it, it just is what it is; hey it wasn't my idea!).
ne_parseInt(String s1, int offset, int length) still isn't the signature to replace "Java's parseInt", if you want to avoid extraction. ne_parseInt(CharSequence s1, int offset, int length), possibly.
- Many misunderstandings...
2004-05-23 10:57:14 mweiher
So Java is broken...tell me something I don't know ;-)
MPWXmlKit isn't implemented in Java, and the examples posted in the article aren't Java either...so how does Java enter into the equation?
Marcel
- Many misunderstandings...
2004-05-23 11:41:15 Brian Ewins
It enters because that's what Jimmy was writing about - a java replacement for a java API. Quoting the article:
"Yet most string-to-data conversion macros or functions, e.g. atoi, atof and Java's parseInt assume tokens in the "extractive" sense. To support the new "non-extractive" tokenization, one can create a mirror set of functions, e.g. ne_atoi, ne_atof and ne_parseInt (ne stands for non-extractive). "
thats why I put quotes around "Java's parseInt" in the previous reply - I'd spotted your stuff was in Objective-C, I know you can do better :)
In java even the supposedly high-performance 'standard' APIs (like SAX) use 'String' everywhere, and so can't be zero-copy like Jimmy's proposal; we have to resort to other hacks - eg I wrote a SAX parser once that used a ternary tree-based StringPool for element/attribute names, it massively reduced the number of strings being created - most parsers use java's String.intern() to pool strings instead, which creates masses of garbage to collect.
Anyway I think I've written enough replies to this article... back to work...
- Many misunderstandings...
2004-05-23 13:37:45 mweiher
"It enters because that's what Jimmy was writing about - a java replacement for a java API. Quoting the article:"
Huh?? He talks about XML/string parsing in general, not about Java in particular. Certainly the references to atoi(), lex, Macros and C are a pretty strong hint that it's not just Java we're talking about here...
But not really important, I think we understand each other :-))
Cheers,
Marcel
- Many misunderstandings...
2004-06-22 12:57:15 jimmy_z
Marcel,
We (XimpleWare) just released our XML processing software and it is at vtd-xml.sf.net. I would like to personally invite you to take a look. Your input and suggestions are very welcome.
Cheers
Jimmy
- Many misunderstandings...
- Many misunderstandings...
- Many misunderstandings...
- Many misunderstandings...
- Many misunderstandings...
- Many misunderstandings...
2004-05-20 11:35:21 jimmy_z
Thanks for posting the question.
I am not particularly familiar with StAX, so there might be ways to refer to tokens after the StAX parser consumes the entire document.
For "non-extractive" parsing, encoded characters can be decoded on the fly when compared against a Java string.
Entities (especially built-in ones) can also be resolved on the fly during comparison.
For "ne_parseInt(String s1, int offset, int length)," one may get the reference to the character array internally or retrieve individual character by the member methods charAt(int).
- Many misunderstandings...
- How you deal with encoding
2004-05-20 02:38:26 Jirka Kosek
I see some problems with you proposal. The one which concerns me most is encoding related.
XML document can be represented in many possible encodings (UTF-8, UTF-16, ISO-8859-1, ...). If you map this document directly into memory without some sort of encoding normalization (which is done in the most today's parsers) you will be forced to manually encode/decode all strings which are read/write from/into document.
Do you have any idea how to deal with this problem.
- How you deal with encoding
2004-06-22 17:40:18 jimmy_z
Jirka,
We (Ximpleware) recently released our software (in Java) under GPL. I would like to personally invite you to visit the project web site (http://vtd-xml.sf.net). Also your suggestions and feedback are very welcome.
Cheers,
Jimmy
- How you deal with encoding
2004-05-20 10:52:57 jimmy_z
Thanks for posting this question.
One way to deal with character encoding is to build "intelligence" into directly various non-extractive string comparison functions.
Most people are used to UCS-2 string representation in their code. So a "non-extractive" comparison function needs to compare UTF-8 tokens (or UTF-16) against UCS-2 strings.
In addition,it may also resolve entity references on the fly during the comparison.
Same thing applies to text to numerical data conversion as well. An non-extractive version of "parseInt" needs to convert a UTF-8 (or UTF-16) token into an integer without "extracting" it out of the source document.
Hope I answered your question.
- How you deal with encoding
