Non-Extractive Parsing for XML

May 19, 2004

Jimmy Zhang


Text processing is one of the most common tasks in application development. Whether it is a Java Servlet or a VOIP application, the conversion from a raw text-based input message to a machine-readable internal representation almost always requires parsing (or tokenization), which, in its current form, refers to a process of extracting tokens (the smallest unit of relevent textual information) and storing them in null-terminated memory blocks, also known as strings. Over the years, people have invented various automation techniques and tools, e.g. regular expression and Lex, to reduce the complexity of manual parsing. Proven both useful and stable, those techniques and tools have stood the test of time. As a result, the current framework of text processing is generally considered to be fairly well-established.

In this article, I propose an alternative to existing "extractive" style of parsing that sits at the lowest layer of the text processing "stack." Furthermore, I intend to show that an arguably small perspective change on the representation of a token can lead to some interesting properties not readily available from today's XML processing techniques. In fact, applying this technique to XML processing can yield some of the advantages being promoted by "binary XML" advocates while still retaining the XML 1.x textual format.

Text Processing: A Non-extractive style

As the logical first step, people use a piece of code called a "tokenizer" to split input text into little tokens. Under the traditional text processing framework, the tokenization step usually involves the change of storage location of the token content. One usually performs the following steps to achieve the purpose of tokenization in C:

  1. Detect the token length (n) in the source document

  2. Dynamically allocate a small memory block of the size n+1.

  3. Copy the content of the token from the source document to the allocated memory block and terminate the token string with a null character

  4. (Typically) discard the source document

Consider the following snippet of an HTTP header as an example.

Accept: */*

Accept-Language: en-us

Connection: Keep-Alive

Host: localhost

Referer: http://localhost/links.asp

After the tokenization step, we have created ten null-terminated string objects, each representing a field in this HTTP header fragment. From this point on, subsequent post processing steps requiring read access would assume those tokens to be the atomic information storage units.











However, we would like to observe that, with some help from pointer arithmetic, the aforementioned tokenization step isn't necessarily the only way for one to gain read-access to the content of a token. Rather than extracting tokens out of the source document, one can instead choose to record a pair of integers representing the starting offset value and length for each token as appeared in the source document, while maintaining the source document in memory.

Using the same HTTP header fragment, the new output, consisting of the source document and an array of integers, is shown below. As an example, we use the first pair of integers (0, 6) to describe the token "Accept", indicating that it starts at offset 0 of the source document, and is 6 characters in length.

Accept: */*

Accept-Language: en-us

Connection: Keep-Alive

Host: localhost

Referer: http://localhost/links.asp

         0,    6

         9,    3

        13,    15

        30,    5

        36,    10

        48,    10

        59,    4

        65,    9

        75,    7

        84,    25


This "non-extractive" style of tokenization has some interesting properties:

  1. One can remove, add or update any token without reserializing the entire file. Using the traditional "extractive" style of parsing, one may need to perform numerous string concatenations in order to compose the updated document. The new "non-extractive" style allows one to directly copy the content from unmodified regions. In the example above, if one wants to replace the token "localhost" whose offset and length values are 65 and 7 respectively, one simply:

    a) copies the content from the beginning to offset value of 64,

    b) appends the new token content, then

    c) completes the operation by appending the remaining "unchanged" content in the source document starting from offset value of 74 (65+9).

    When there are a large number of tokens in the source document, the performance improvement resulting from this non-extractive style of parsing can be quite significant as compared to the extractive style.

  2. One can remove, add or update a group of adjacent tokens without reserializing the entire file. Similar to #1, one can treat a group of adjacent tokens as a single large token by making calculations based on the starting offset and length values of the first and last token of the group.

  3. One can address any fragment of the source message by a pair of integer (offset + length). Because a fragment in a source message consists of a group of adjacent tokens, one can address, or even take out, the fragment from the source message by calculating, the starting offset and length of the fragment (as in #2).

  4. With #3, one can efficiently pull out and splice together fragments from multiple source messages to compose new messages. An interesting application of #3 is that one can efficiently compose a new document by appending together fragments of several documents, as long as each document is tokenized "non-extractively."

  5. The original text messages is fully- preserved and dealing with whitespace is easier. Because one maintains the original document in memory, no information is lost after the "non-extractive" tokenization step. To remove the leading or trailing whitespace, one simply recalculates the starting offset and length of a "trimmed" token.

  6. One can write a pair of integers for every token in the source message into a binary file that can be persisted on disk. To understand this property, observe that "extractive" tokenization requires the use of absolute pointers that are difficult to be made persistent, whereas the "non-extractive" tokenization uses "soft" pointers (starting offset) and relies on runtime pointer arithmetic to gain access to the token content.

  7. With #6, pre-parsing a text message or "parse once, use many times" become possible. When you load both the source message and the binary file (containing offset and length of tokens), you don't have to parse every time you process the source message in a read-only manner. Its immediate benefit is that one can generate the binary file and save it along with the source document prior to any ensuing post-processing.

Does It Work?

Let's examine how non-extractive parsing works in practice. As an example, consider how to compare a token with a known string value. In C's <string.h> library (shown below), we can straightforwardly use the function "strncmp()" for that purpose.

#include <string.h>

int strncmp(const char *s1, const char *s2, size_t n); 

Suppose sm points to the source message and s1 points to a known string value to be compared against, one simply sets the value s2 to the value (sm + starting_offset), and n to the length of the token. In other words, the support of comparison for both "extractive" tokenization (i.e. strcmp) and "non-extractive" tokenization (i.e. strncmp) actually exists since very early. Yet most string-to-data conversion macros or functions, e.g. atoi, atof and Java's parseInt assume tokens in the "extractive" sense. To support the new "non-extractive" tokenization, one can create a mirror set of functions, e.g. ne_atoi, ne_atof and ne_parseInt (ne stands for non-extractive). We compare function signatures between those two functions sets in the table below.

Macros/Functions to work with "Extractive" Tokenization Macros/Functions to work with Non-Extractive Tokenization
atoi(char* s1) ne_atoi(char* sm, int offset, int length)
atof(char* s1) ne_atoi(char* sm, int offset, int length)
parseInt(String s1) ne_parseInt(String s1, int offset, int length)

Implication for XML Processing

Because XML is a text-based markup language, a "non-extractive" style of XML processing will possess the following properties:

  1. One can remove, add, or update any token in an XML document without reserializing the entire document. Notice that this comes straight from the first property for "non-extractive" parsing. Using current "extractive" XML processors, such as DOM, SAX or PULL, a simple update can potentially require that the entire document be taken apart into tokens before putting those tokens back. When most part of the document isn't relevant to the modification, redundant re-serialization can incur significant performance overhead.

  2. One can remove, add or update any element without reserializing the whole thing every time. An element in an XML document consists of a group of adjacent tokens, such as the starting/ending tags, its text content, attributes and child elements. Applying "non-extractive" tokenization allows one to, similar to #1, avoid reserialization on irrelevant parts of the document associated with "extractive" tokenization. This, again, buys performance.

  3. One can address any fragments of an XML document by a pair of integers (offset + length). Those two integers can be calculated from the offset and length of both the first and the last tokens in the fragment. Also the fragment "descriptor" (consisting of those two integers) is persistent, meaning that one can potentially save the descriptor on disk for later use.

  4. With # 3, one can efficiently pull out and splice together fragments from multiple XML messages to compose new XML document. Assuming we have a fragment "descriptor" for each document of interest, we can efficiently compose a new XML document by concatenating those fragments like buffers.

  5. The original XML messages is fully-preserved and dealing with whitespace is easier. Whitespace handling has been source of controversy among various "extractive" style XML processor implementations. A non-extractive XML processor has the option that extractive ones don't have: Because a non-extractive XML processor maintains the source document in memory, it allows one to recover insignificant whitespace in the worst case.

  6. One can write a pair of integers for every token in the XML document into a binary file that can be persisted on disk. In other words, one now has the option to persist the "parsed state" on disk.

  7. With #6, preparsing an XML message or "parse once, use many times" becomes possible. When one loads in memory both the XML document and the binary, he can directly process the XML document. If any subsequent processing only involves read access, then one can potentially parse the document once and use it many times. Imagine an XML document tagged along with the parsed state can potentially help reduce the processing overhead incurred by network hops/intermediaries.

In addition, one can combine some of the above properties to create new properties and applications. In his article, Steven J. Vaughan-Nichols wrote that "... Yet another problem is that there's no way to pull out a single string of data, say the contents of a field, from an XML document without having to retrieve and parse the entire document." By combining #1, #6 and #7, one comes up a pretty good solution to the problem.

"Binary-Enhanced" XML

There has been ongoing debate on the topic of binary XML as manifested by the binary XML workshop held by W3C last September. Poor performance of XML processing is often mentioned as one of the reasons behind the recent effort to look for an optimized infoset serialization format. One way to look at the "parse once, use many times" property afforded by a non-extractive parsing style is that it can potentially provide a pre-parsed XML that is also backward compatible with XML--in other words, one loses no advantages inherent to XML, e.g. flexibility and human readability. As an example, an XML search engine using "non-extractive" style pre-parsed XML could allow users to query a large set of XML documents across the Internet using XPath at significantly better throughput than today's XML processing technology, even higher than many proposed binary XML formats.


I have proposed a new "non-extractive" style of tokenization for text processing purposes. I've also outlined a few of its properties in the context of XML processing. As XML finds more uses as an open document/data encoding format, people will increasingly demand more efficient and sophisticated XML processing capabilities. Although the basic concept of non-extractive parsing is simple, it can potentially lead to a new direction in thinking of XML processing.