XML Parser Benchmarks: Part 2

May 16, 2007

In part 1 of this series we showed you the results of our event-driven parser benchmarks. The outcome of these benchmarks showed that the LIBXML2 SAX-like parser in C is superior over the other tested parsers. In second place followed the two Java pull-parser implementations Javolution and Woodstox.

In this part of the series we will show you how the object model parser performed in our tests. Object model parsers read in the data by using the event parsers. The object model parser benchmarks were of special interest for our high performance web service security gateway, because most web services security operations involve that at least the header of a SOAP message is read and altered. This in-memory altering can only be done by object model parsers like DOM implementations. The results for the AXIOM implementations are also very interesting in this context. They use a pull-parser to build up the in-memory representation of a XML document until the last node that needs to be read or altered. This has the advantage that not the whole document needs to be read into memory.

The test setup is the same as in Part 1 of this series, only the AXIOM benchmark in C was compiled with the Mircosoft C/C++ compiler. For each parser the document throughput per second is measured.

The following list shows all tested object model parsers.

The Tested Object Model Parsers

LIBXML2 Tree 2.6.27 (C)
LIBXML2 tree is DOM like XML parser. It uses the LIBXML2 SAX-like implementation to read in the XML data.
Java 1.5 Default DOM (Java)
The default DOM implementation in Java 1.5. Uses the default SAX implementation to read in the documents.
Apache AXIOM Java 1.1.2, C 0.96 (Java und C)
AXIOM is a XML object model by Apache. It was developed for Apache's Web Service Engine AXIS2, but it is pushed forward as a separate project. Currently there exist a Java and a C version of the parser. The Java version uses the Woodstox StAX parser to read in the documents. The C version uses the LIBXML2 stream pull-parser. As already mentioned AXIOM has the advantage of only building the document tree in memory until the last node of which data is needed. This way the whole tree only has to be built when the data in the end of the document is required to be read or altered. The C implementation is currently in version 0.96 and can therefore not be considered as fully stable.
DOM4J 1.6.1 (Java)
DOM4J is an object model parser whose API was specially built for convenient use in the object oriented context of Java.
JDOM 1.0 (Java)
Like DOM4J, this parser was built out of the need for an API that is more convenient to use in an object-oriented context than the W3C DOM specification.
Oracle XDK DOM implementation (C)
This parser of the XDK (XML Development Kid) by oracle implements the W3C DOM specification.

Object Model Parser Benchmarks

The following benchmarks show the results for the tested parsers which build a document model in memory. In these benchmarks AXIOM cannot play out its advantages because in all tests the whole document is processed.
Benchmark results for object model parsers with small documents
Figure 1: Benchmark results for the object model parsers for small documents

Figure 1 shows that LIBXML2 is much faster than all other implementation for these three small document sizes. The two AXIOM parsers perform well for very small documents, since they do not seem to have the same overhead the DOM parsers expose. The Java 1.5 default DOM parser is the fastest of the three Java DOM parsers, closely followed by JDOM and dom4j. The Oracle DOM parser seems to have a significant overhead for each document it reads, since it reveals the worst performance for small documents.

Figure 2: Benchmark results for object model parsers with medium-sized documents

In the next benchmark for medium-sized documents (Figure 2) LIBXML2 is still ahead of the others for documents up to 455 KB. The Oracle DOM implementation does better as the documents get bigger and catches up to LIBXML2 for documents around 455 KB in size. Both AXIOM implementations do worse with increasing document size. Of the three Java object model parsers the Java 1.5 default DOM parser is always ahead of dom4j, and dom4j always ahead of JDOM.

Figure 3: Benchmark results for object model parsers with large documents

Figure 3 reveals that the AXIOM implementations do significantly worse than all other implementation for large documents. For the 4 MB document the C implementation of AXIOM has a performance drop. LIBXML2 looses its leading position for these document sizes and is overtaken by Java 1.5 DOM, the Oracle parser and dom4j for the 4 MB files.

Partial Document Parsing Benchmark

In the previous benchmarks we tested the complete walk through the documents in which the AXIOM implementations could not play out their advantages of only building the object tree until the last requested node. In the following benchmarks we only requested the first 67 elements of each document. This corresponds, for example, to the use case of only checking the header of a SOAP message for its contents.

Figure 4: Benchmark results for the reading of only the first 67 elements in small documents

In Figure 4 we can see that the AXIOM implementations cannot play out their advantages for small documents until this size of 5 KB. From the 13.5 KB sized files on, both implementations beat LIBXML2 and Java DOM.

Figure 5: Benchmark results for the reading of only the first 67 elements in large documents

In Figure 5 you can see that the two AXIOM implementations expose the same performance for all document sizes, which is expected since the only need to read in the first 67 elements. The other parser, obviously perform worse with growing document sizes because they need to build the whole document tree before they can walk through the elements.

Conclusions

From the above presented benchmarks, LIBXML2 can be considered as the overall performance winner for object model parsers. It not only performs much better than all other parsers on documents up to 500 KB in size, but it also beats the two AXIOM implementations for documents up to 5 KB, when only the first part of the documents is read. It also does especially well for very small documents of about 1 KB where it is up to 10 times faster than the other implementations. For really big documents above 500 KB the default Java 1.5 DOM parser and the Oracle DOM parser in C are alternatives.
But as the partial documents parsing benchmarks show, it is advisable that you evaluate which use cases of XML processing you will perform the most. If you find that in most cases you will only need to alter parts in the beginning of a XML document, you should consider using the Java AXIOM implementation. Due to the version status of 0.96 of the AXIOM implementation in C, and the significant performance drop for large documents, we recommend you to wait for future releases of that parser. dom4j does slightly worse, compared to the Java 1.5 default DOM implementation, but has a more convenient API.
Of course development time also plays a significant role in the decision process which parser to choose. For all tested C parsers you have to be very careful not to produce memory leaks, which will slow down the development. On the other hand especially the JDOM and dom4j APIs are very convenient to use.

Together with other benchmarks we performed on security operations like encryption and signature, the benchmarks of this article made us confident to use the LIBXML2 parser in C, and C security libraries for our high performance web service security stack. The C libraries also have the advantage of using less memory than a full fledged JVM, which is an advantage on small security appliances that we want to use.

Additional Resources

Java Web services, Part 2: Digging into Axis2: AXIOM by Dennis Sosnoski
Sun's XMLTest XML parser benchmark tool
xmlbench, which is a XML parser benchmark tool for C parsers