XML Parser Benchmarks: Part 1
Five years after the introduction of SOAP 1.0, XML parsing is still the main bottleneck in web service performance. In search of components for a high performance web service security solution, we have executed benchmarks for various XML parsers in Java and C. These benchmarks cover event-driven parser models like SAX and StAX, object model parsers like DOM, and also new breeds of XML parsers like Apache's AXIOM, which only builds parts of the document tree in the memory.
Our intention was to find the right components for our high performance web service security gateway, so that it could be run on a small dedicated appliance. The limited resources of such a device brought the C tests into the game, since the Java virtual machine already needs a lot of memory. Object model parsers are the most important parser types in the context of web service security because they can be used to alter a XML document in memory. In this first part of a two-part series, we will present our benchmark results for the event-driven parsers like SAX and StAX, because those are used by the object model parsers, and therefore determine the performance of object model parsers by a large amount. First, we will give you a quick overview in the XML parser jungle.
Recap of XML Parser Types
Generally, there are two types of XML parsers. First are the push- and pull-parsers that simply read a XML document and return the data and structure of the document (e.g., SAX and StAX). Both are event-driven parsers because they return events that the developer has to handle. Push parsers implementations like SAX (Simple API for XML) return the data of the whole document in one stream and cannot be stopped (you could throw an exception in Java). Pull-parsers, on the other hand, only return data when they are asked to read the next node in a document. StAX (Streaming API for XML) is a pull-parser specification for Java defined in JSR 173.
The second type of XML parsers are object model parsers (e.g., DOM and Apache AXIOM), which not only read the data but also construct an in-memory representation of the document, which can be altered. Since DOM parsers mostly use SAX parsers to read in the documents, it is clear that the object model of a document is always built completely. This is a performance limitation if only data at the beginning of a document needs to be read and altered. New approaches like Apache's AXIOM make use of StAX pull-parser implementations to overcome this limitation. AXIOM only builds the tree representation of a document until the last node that was requested. Therefore, it does not need to read the complete document.
In this first part of the series we will talk about the performance of the reading parsers. Since these parsers are used by the object model parsers to read in the data, we can already make assumptions about the performance of the corresponding object model parsers.
The Tested Event-Driven Parsers
- LIBXML2 Stream Pull-Parser + SAX-like 2.6.27 (C): LIBXML2 is a C library that provides several APIs for XML processing and manipulation. Besides a DOM-like implementation it also provides a streaming pull-parser and a SAX-like interface. The latter is used to read in the data for the DOM-like parser.
- Java 1.5 Default SAX (Java): This parser is the default SAX parser in Java 1.5.
- Woodstox StAX Pull-Parser 3.1.0 (Java): Woodstox is a JSR173 conforming StAX parser implementation. It was created by the open source community Codehaus and is tightly coupled with its SOAP engine, XFire.
- Sun SJSXP StAX Pull-Parser 1.0-b26 (Java): The SJSXP is Sun's implementation of the JSR173 StAX specification. It is shipped with the Java 6 SDK.
- BEA StAX implementation 1.1.2 (Java): This is a JSR173 implementation by BEA.
- Javolution StAX-like implementation 4.0 (Java): Javolution is an open source project that aims on enhancing the performance of the Java base library. It provides a StAX-like XML parser that does not fully comply to JSR 173.
- Oracle StAX implementation XDK 10.1.0.1 (Java): A JSR 173 implementation by Oracle.
The Test Environment
The main benchmark tool that we used is a modified version of Sun's XMLTest. It lets you define test suites that describe which parsers are tested with which documents. On execution, it measures how many documents a parser processed in a specified period of time, and calculates the throughput per second for it. The most modification involved the inclusion of the external C benchmarks into the tests. Those benchmarks were inspired by the xmlbench benchmark tool, which is under the GNU public license. All C tests in the benchmarks of this part of the series used the GNU GCC compiler. Each Java benchmark was executed with the -server option to reserve more resources for the JVM. All tests were run on a Fujitsu Siemens S Series notebook, with a 1.70 GHz Intel processor and 1 GB RAM.
The Benchmark Execution
The aim of our benchmarks was to measure how many documents a parser can process in a given time. Processing means that the parser walks through the whole document and counts the number of elements, attributes, and the length of the text elements. This way we were able to see if each parser performs the same walk through the document. We measured the throughput for 15 seconds with a 5 second warm-up phase for each parser. XMLTest presents the results as bar charts with the throughput per second on the Y axis, and the different parsers on the X axis. The XML documents which we used are purchase order XML files that are provided with XMLTest. These documents have a maximum depth of 6, and an almost equal amount of elements and attributes.
The Event Parser Benchmarks
First, we will show you the benchmark results for the push-parsers and one StAX implementation.
Figure 1: Benchmark results for event-driven XML parsers and small documents
Figure 2: Benchmark results for event-driven XML parsers and large documents
In Figures 1 and 2 you can see that the LIBXML2 SAX-like parser (red) does much better than all other implementations. This implies that the LIBXML2 object model parser has an advantage over the other implementations because it uses the LIBXML2 SAX parser to read in the documents. Unfortunately the LIBXML2 SAX-like parser has a very complex interface. And also, as in most C XML parsers a great amount of focus has to be laid on the memory management. In second place is the Woodstox StAX implementation (yellow). The LIBXML2 stream pull-parser (blue) and the Java 1.5 default SAX implementation (green) show almost even results.
Pages: 1, 2