Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Introduction

Text processing is one of the most common tasks in application development. Whether it is a Java Servlet or a VOIP application, the conversion from a raw text-based input message to a machine-readable internal representation almost always requires parsing (or tokenization), which, in its current form, refers to a process of extracting tokens (the smallest unit of relevent textual information) and storing them in null-terminated memory blocks, also known as strings. Over the years, people have invented various automation techniques and tools, e.g. regular expression and Lex, to reduce the complexity of manual parsing. Proven both useful and stable, those techniques and tools have stood the test of time. As a result, the current framework of text processing is generally considered to be fairly well-established.

In this article, I propose an alternative to existing "extractive" style of parsing that sits at the lowest layer of the text processing "stack." Furthermore, I intend to show that an arguably small perspective change on the representation of a token can lead to some interesting properties not readily available from today's XML processing techniques. In fact, applying this technique to XML processing can yield some of the advantages being promoted by "binary XML" advocates while still retaining the XML 1.x textual format.

Text Processing: A Non-extractive style

Related Reading

Learning XML
By Erik T. Ray

As the logical first step, people use a piece of code called a "tokenizer" to split input text into little tokens. Under the traditional text processing framework, the tokenization step usually involves the change of storage location of the token content. One usually performs the following steps to achieve the purpose of tokenization in C:

  1. Detect the token length (n) in the source document

  2. Dynamically allocate a small memory block of the size n+1.

  3. Copy the content of the token from the source document to the allocated memory block and terminate the token string with a null character

  4. (Typically) discard the source document

Consider the following snippet of an HTTP header as an example.

Accept: */*
Accept-Language: en-us
Connection: Keep-Alive
Host: localhost
Referer: http://localhost/links.asp

After the tokenization step, we have created ten null-terminated string objects, each representing a field in this HTTP header fragment. From this point on, subsequent post processing steps requiring read access would assume those tokens to be the atomic information storage units.

Accept\0"
"*/*\0"
"Accept-Language\0"
"en-us\0"
"Connection\0"
"Keep-Alive\0"
"Host\0"
"localhost\0"
"Referer\0"
"http://localhost/links.asp\0"

However, we would like to observe that, with some help from pointer arithmetic, the aforementioned tokenization step isn't necessarily the only way for one to gain read-access to the content of a token. Rather than extracting tokens out of the source document, one can instead choose to record a pair of integers representing the starting offset value and length for each token as appeared in the source document, while maintaining the source document in memory.

Using the same HTTP header fragment, the new output, consisting of the source document and an array of integers, is shown below. As an example, we use the first pair of integers (0, 6) to describe the token "Accept", indicating that it starts at offset 0 of the source document, and is 6 characters in length.

Accept: */*
Accept-Language: en-us
Connection: Keep-Alive
Host: localhost
Referer: http://localhost/links.asp

         0,    6
         9,    3
        13,    15
        30,    5
        36,    10
        48,    10
        59,    4
        65,    9
        75,    7
        84,    25
                     

This "non-extractive" style of tokenization has some interesting properties:

  1. One can remove, add or update any token without reserializing the entire file. Using the traditional "extractive" style of parsing, one may need to perform numerous string concatenations in order to compose the updated document. The new "non-extractive" style allows one to directly copy the content from unmodified regions. In the example above, if one wants to replace the token "localhost" whose offset and length values are 65 and 7 respectively, one simply:

    a) copies the content from the beginning to offset value of 64,

    b) appends the new token content, then

    c) completes the operation by appending the remaining "unchanged" content in the source document starting from offset value of 74 (65+9).

    When there are a large number of tokens in the source document, the performance improvement resulting from this non-extractive style of parsing can be quite significant as compared to the extractive style.

  2. One can remove, add or update a group of adjacent tokens without reserializing the entire file. Similar to #1, one can treat a group of adjacent tokens as a single large token by making calculations based on the starting offset and length values of the first and last token of the group.

  3. One can address any fragment of the source message by a pair of integer (offset + length). Because a fragment in a source message consists of a group of adjacent tokens, one can address, or even take out, the fragment from the source message by calculating, the starting offset and length of the fragment (as in #2).

  4. With #3, one can efficiently pull out and splice together fragments from multiple source messages to compose new messages. An interesting application of #3 is that one can efficiently compose a new document by appending together fragments of several documents, as long as each document is tokenized "non-extractively."

  5. The original text messages is fully- preserved and dealing with whitespace is easier. Because one maintains the original document in memory, no information is lost after the "non-extractive" tokenization step. To remove the leading or trailing whitespace, one simply recalculates the starting offset and length of a "trimmed" token.

  6. One can write a pair of integers for every token in the source message into a binary file that can be persisted on disk. To understand this property, observe that "extractive" tokenization requires the use of absolute pointers that are difficult to be made persistent, whereas the "non-extractive" tokenization uses "soft" pointers (starting offset) and relies on runtime pointer arithmetic to gain access to the token content.

  7. With #6, pre-parsing a text message or "parse once, use many times" become possible. When you load both the source message and the binary file (containing offset and length of tokens), you don't have to parse every time you process the source message in a read-only manner. Its immediate benefit is that one can generate the binary file and save it along with the source document prior to any ensuing post-processing.

Does It Work?

Let's examine how non-extractive parsing works in practice. As an example, consider how to compare a token with a known string value. In C's <string.h> library (shown below), we can straightforwardly use the function "strncmp()" for that purpose.

#include <string.h>
int strncmp(const char *s1, const char *s2, size_t n); 

Suppose sm points to the source message and s1 points to a known string value to be compared against, one simply sets the value s2 to the value (sm + starting_offset), and n to the length of the token. In other words, the support of comparison for both "extractive" tokenization (i.e. strcmp) and "non-extractive" tokenization (i.e. strncmp) actually exists since very early. Yet most string-to-data conversion macros or functions, e.g. atoi, atof and Java's parseInt assume tokens in the "extractive" sense. To support the new "non-extractive" tokenization, one can create a mirror set of functions, e.g. ne_atoi, ne_atof and ne_parseInt (ne stands for non-extractive). We compare function signatures between those two functions sets in the table below.

Macros/Functions to work with "Extractive" Tokenization Macros/Functions to work with Non-Extractive Tokenization
atoi(char* s1) ne_atoi(char* sm, int offset, int length)
atof(char* s1) ne_atoi(char* sm, int offset, int length)
parseInt(String s1) ne_parseInt(String s1, int offset, int length)

Implication for XML Processing

Because XML is a text-based markup language, a "non-extractive" style of XML processing will possess the following properties:

  1. One can remove, add, or update any token in an XML document without reserializing the entire document. Notice that this comes straight from the first property for "non-extractive" parsing. Using current "extractive" XML processors, such as DOM, SAX or PULL, a simple update can potentially require that the entire document be taken apart into tokens before putting those tokens back. When most part of the document isn't relevant to the modification, redundant re-serialization can incur significant performance overhead.

  2. One can remove, add or update any element without reserializing the whole thing every time. An element in an XML document consists of a group of adjacent tokens, such as the starting/ending tags, its text content, attributes and child elements. Applying "non-extractive" tokenization allows one to, similar to #1, avoid reserialization on irrelevant parts of the document associated with "extractive" tokenization. This, again, buys performance.

  3. One can address any fragments of an XML document by a pair of integers (offset + length). Those two integers can be calculated from the offset and length of both the first and the last tokens in the fragment. Also the fragment "descriptor" (consisting of those two integers) is persistent, meaning that one can potentially save the descriptor on disk for later use.

  4. With # 3, one can efficiently pull out and splice together fragments from multiple XML messages to compose new XML document. Assuming we have a fragment "descriptor" for each document of interest, we can efficiently compose a new XML document by concatenating those fragments like buffers.

  5. The original XML messages is fully-preserved and dealing with whitespace is easier. Whitespace handling has been source of controversy among various "extractive" style XML processor implementations. A non-extractive XML processor has the option that extractive ones don't have: Because a non-extractive XML processor maintains the source document in memory, it allows one to recover insignificant whitespace in the worst case.

  6. One can write a pair of integers for every token in the XML document into a binary file that can be persisted on disk. In other words, one now has the option to persist the "parsed state" on disk.

  7. With #6, preparsing an XML message or "parse once, use many times" becomes possible. When one loads in memory both the XML document and the binary, he can directly process the XML document. If any subsequent processing only involves read access, then one can potentially parse the document once and use it many times. Imagine an XML document tagged along with the parsed state can potentially help reduce the processing overhead incurred by network hops/intermediaries.

In addition, one can combine some of the above properties to create new properties and applications. In his article, Steven J. Vaughan-Nichols wrote that "... Yet another problem is that there's no way to pull out a single string of data, say the contents of a field, from an XML document without having to retrieve and parse the entire document." By combining #1, #6 and #7, one comes up a pretty good solution to the problem.

"Binary-Enhanced" XML

There has been ongoing debate on the topic of binary XML as manifested by the binary XML workshop held by W3C last September. Poor performance of XML processing is often mentioned as one of the reasons behind the recent effort to look for an optimized infoset serialization format. One way to look at the "parse once, use many times" property afforded by a non-extractive parsing style is that it can potentially provide a pre-parsed XML that is also backward compatible with XML--in other words, one loses no advantages inherent to XML, e.g. flexibility and human readability. As an example, an XML search engine using "non-extractive" style pre-parsed XML could allow users to query a large set of XML documents across the Internet using XPath at significantly better throughput than today's XML processing technology, even higher than many proposed binary XML formats.

Summary

I have proposed a new "non-extractive" style of tokenization for text processing purposes. I've also outlined a few of its properties in the context of XML processing. As XML finds more uses as an open document/data encoding format, people will increasingly demand more efficient and sophisticated XML processing capabilities. Although the basic concept of non-extractive parsing is simple, it can potentially lead to a new direction in thinking of XML processing.


Comment on this articleWhat's your opinion on this technique?
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Full Threads Newest First
  • Working Example
    2005-03-14 03:46:48 Julian Turner [Reply]

    I independently happened upon non-extractive thinking, when my web hosting company did not provide an XML parser component.


    I wrote my own parse, and mindful of the need to avoid server overheads, came up with the idea of a simple string based XML parser in Javascript, where effectively the XML is retained as a single string, and each "node" you create identifies a start and end point in the string.


    I.e. the document is not parsed. What happens is that if you call say "firstChild" the method will use string functions to find the start and end of the the node, and then return a node object. So node objects are only created as you need them.


    Because Javascript does not support pointers, the main processing overhead is the need to update these start and end points (if you are doing editing).


    If you are doing simple reading, then the only overhead is storing the XML string in memory, which is considerably less demanding than parsing the document.


    RegExp patterns are used to implement simple node searching.


    See my web site www.baconbutty.com which is based on this + HTTPRequest.











  • Buffering document?
    2004-06-20 03:26:52 olegtkachenko [Reply]

    Well, another problem is potentially huge memory consumption in streaming scenarios - when input document is represented as forward-only I/O stream.
    While SAX and XmlReader in .NET handle it very well, NE parser would need to buffer the entire XML document.


  • BEWARE: DANGEROUS ERRORS IN ARTICLE!
    2004-05-31 07:33:34 OskariOlematon [Reply]

    Hi,


    I would like to point out that the coding examples in the article contain at least two very elemental but dangerous C library usage errors. Please do something about them so that such mistakes are not copied by other inexperienced C programmers. The errors are related to C library functions strncmp() and atoi().


    strncmp() is a very dangerous function which is better avoided if possible. If not possible, then it should be wrapped to make it easier to use. The error this article propagates is the failure to understand that even a partial match is considered a match. As an example, the following comparison reports the compared strings as equal:


    strncmp("tokenNOT!", "...token..." + 3, 5)


    This is exactly how the article proposes to compare a known string to a token. In this case they are erroneously reported as equal. Please don't do this! One must consider the whole known string.


    The other error is to advocate the use of atoi(). atoi() should not be used because it offers no error detection what so ever. Use strtol() or something else instead.


    These kinds of errors a the worst because they are very difficult to find in testing. They can go hiding for years and then emerge to bring your software down once it has been installed at hundreds of client sites. I sincerely wish the article was edited to correct these errors.


    Best Regards,
    Semi

  • Wrong approach
    2004-05-28 16:00:36 Michael Maron [Reply]

    One simple question: why would anybody want to fight with basic ideas of XML in the first place? As far as C++ and especially C are concerned, the answer is obvious: working with XML in C/C++ world is not particularly convenient, so one can look for simpler alternatives to standard parsers - like one considered in this article.


    From the other side, regular expressions actually can provide a reasonable alternative to full-scale parsing. Unfortunately, regexps are not common for C/C++ as well, so this is hardly a way out.


    But in Perl and in Java regexps are ceratinly very useful for simple XML manipualtions. For example, some XML converters can produce non-valid XML documents which can be transformed to valid ones using Perl or Java regexps.


    Another serious alternative to XML DOM and SAX is JDBC in Java.

  • MPWXmlKit has done this for over 5 years
    2004-05-22 13:34:36 mweiher [Reply]

    Along with a very fast low-level scanner and special optimizations for CDATA sections, this allowed a parser with a pretty high-level interface (all data provided as Objective-C objects) to compare favorably in speed with the fastest plain C parsers.


    The primary reason I wanted this was in order to be able to do partial processing, dealing only with certain sections and effectively + correctly ignoring the rest with both very high speed and bit-by-bit fidelity.


    The encoding issues can be dealt with very elegantly the same way the NSString class cluster deals with them: specific subclasses can handle different encoding (styles), and remember what their encoding is. Strings with compatible/identical encodings can just ignore the issue, incompatible strings need to go through a canonical encoding (unicode).


    MPWXmlKit can be downloaded from www.metaobject.com.


    Keep in mind though, that this is mostly an internal parser trying to deal with specialized requirements, it doesn't have all the features todays parser have, because I have no no need for them.


    Marcel


  • not new... or usable
    2004-05-20 13:35:38 Richard Hough [Reply]

    I used an XML parser, called the Generic Restricted XML parser or something, several years back that did this. I did not find it very useful.


    As other posters pointed out, it does not handle encodings or character entities. It does not do normalization, which we needed. It is also inefficient in dealing with large files which you only need a small portion of. These are the majority of XML files in my experience.


    The suggestion to do such transformations "on the fly" is a poor one since it requires the processing to be done every time the element is used, rather than just once when the file is loaded. It will also not work if the element refers to external files, such as with XIncludes.

  • Many misunderstandings...
    2004-05-20 08:22:25 Brian Ewins [Reply]

    The author mischaracterises Pull parsers, they do pretty much exactly what is described in this article - they don't "extract" strings at all unless you ask them to. Using StAX, exactly the desired pointers can be fetched like so:


    XMLInputFactory f = XMLInputFactory.newInstance();
    XMLEventReader r = f.createXMLEventReader(...);
    while(r.hasNext()) {
    XMLEvent e = r.nextEvent();
    Location loc = e.getLocation();
    // now have systemid, offset
    }


    However the problem isn't anything like as simple as described in this article anyway: parsing an xml document involves resolving entity references - which can synthesise markup, include other documents, and so on. The use of defaults values in both XML Schema and DTDs also means that the PSVI (the "deserialized" document data) contains items that have /no/ location in the original document. Pointers into the document just aren't enough, and /may not even be possible/.


    In any case, some of the code in the article is wrong: ne_parseInt(String s1, int offset, int length) still requires extraction in java (strings always clone the content they are constructed with) - ne_parseInt(char[] s1, int offset, int length) would have been closer to the article's intent, but is still both inefficient and wrong: on large files this requires loading the entire document into memory instead of just "mmap"ping it (using nio in java), plus it treats xml as character data when in fact it's binary data (think about the byte order mark, the 'encoding="..."' attribute in the "xml" PI, character entities, etc: reading an xml file as an array of characters does not tell you the characters in the PSVI without more work); plus it ignores the fact that a field which is apparently an integer could look like "1<!-- 2 -->¬hing;<![CDATA[3]]>" (intended to be parsed as "13"), so pointing at ranges of chars from the original document for parsing numbers may not work at all.


    The upshot is that what this article describes is only useful in a restricted subset of XML, and only when you care about the serialization of the XML not its meaning. The only apps I can think of that behave like this are editors? If its just about preserving things like comments, the XMLBeans api looks better to me; if its about performance, then StAX is better designed to understand all XML.




  • How you deal with encoding
    2004-05-20 02:38:26 Jirka Kosek [Reply]

    I see some problems with you proposal. The one which concerns me most is encoding related.


    XML document can be represented in many possible encodings (UTF-8, UTF-16, ISO-8859-1, ...). If you map this document directly into memory without some sort of encoding normalization (which is done in the most today's parsers) you will be forced to manually encode/decode all strings which are read/write from/into document.


    Do you have any idea how to deal with this problem.