Constructing the XML Parser Benchmark

May 5, 1999

This section describes in detail how I went about testing parser performance in C, Java, Perl, and Python on a Linux system. If you are so inclined, run the test yourself on your own system.

The humble XMLstats program

I described this program in my September article on this site. It produces a top-down statistical report on the elements in an XML document: the number of occurrences; a breakdown of the number of children, parents, and attributes; and character count. I think this is a good exercise for a parser since:

It does something useful.
It processes every start tag and piece of text.
It doesn't do too much work beyond parsing.

In any case, it was at hand and I didn't have to invent some other kind of made-up application.

The main work that this program has to do outside of parsing is order the elements in a top-down fashion, yet account for the fact that element containment graphs may have cycles.

Test environment

I ran these tests on my Gateway Solo 5150 laptop with Red Hat 5.2 Linux installed. This is a Pentium II machine with 64 megabytes of RAM. The CPUinfo reports that it tests at 232.65 Bogomips.

The C compiler is GCC 2.7.2.3, the one that came with Red Hat 5.2. I'm using the pre-release version 1.2 Java Development Kit from Blackdown. The Perl version is 5.005_02, and Python 1.5.1 is installed on my machine.

Test documents

All the test documents descended from REC-xml-19980210.xml, the XML version of the XML specification. The only change from it in REC.xml was the removal of the system identifier, spec.dtd, from the DOCTYPE declaration. Some of the parsers wanted it to be there if you declared it, even in non-validating mode.

The other documents are mechanically expanded versions of REC.xml. In the case of med.xml and big.xml, the contents of the root element were just repeated, 8 and 32 times respectively. Of course, this would make them invalid even without an element model since we've repeated ID attributes. (Assuming attributes named "id" are meant to be ID attributes.) But they're still well-formed.

In the case of chrmed.xml and chrbig.xml, just the text contents were repeated, 8 and 32 times respectively. This was accomplished with the use of the scale.pl Perl script. Because of the way these were generated, they have no prolog and entities in the original document are pre-expanded.

Test document statistics

	REC	chrmed	med	chrbig	big
size (bytes)	159339	893821	1264240	3417181	5052472
markup density	34%	6%	33%	2%	33%

It would have been interesting to see how a document closer to RDF would have fared, but I ran out of time. I'm hoping that this article will instigate other benchmarks that look at things like RDF.

Methodology and results

The performance of an implementation on each test case was measured using the Unix time command, with output being sent to the /dev/null data sink. Actually each case was measured three times and the average was taken. And these measurements didn't start until after each implementation had a chance to position itself into the physical memory working set.

This command delivers three timing numbers, processor seconds allocated to the process (user time), operating system processor seconds spent in service of the process (sys time), and actual elapsed time (real time).

To avoid errors due to hand transcription, the entire test process was automated using this Perl script. While this script was running, no other activity was demanded from my laptop.

XML parser performance statistics

The competitors

Four of the parsers are either written by or based on the work of James Clark. Clark wrote the Java XP parser and the C-Expat parser. Both the Perl and Python parsers used here are based on Expat.

Trying to get each implementation to produce exactly the same results exposed some bugs in the original XMLstats program. One of those bugs was that XMLstats was counting UTF8 bytes instead of characters when it reported character statistics. Another was that it didn't count all the white space that was part of mixed content. So the XMLstats program here is not the same as the one in the XML::Parser distribution.

C-Expat

While two of the other parsers in this test are built on top of James Clark's C-Expat parser, this example uses it directly. I'm using the test version of Expat, identified as "Version 19990307."

Although there is a hash table implementation in Expat, it's not part of the public interface. So I have included in the Util directory, a hash implementation that I created several years ago. It draws heavily on Larry Wall's implementation of hashes in Perl. This is the first public distribution of this package.

RXP

RXP is a parser produced by Richard Tobin of Language Technology Group at the University of Edinburgh. I'm using version 1.0 that I obtained from Richard's RXP web site.

RXP is stream-oriented, but instead of using callbacks, the application drives the main loop, asking the parser to give it the "next bit" it recognizes. Different flavors of bits are associated with different kinds of markup.

RXP has a validating mode, which wasn't used in this test.

Java-XP

I'm using XP version 0.4. The latest version of XP is available from James Clark's FTP site.

Although XP provides a SAX interface, I chose to use the native XP interface, in which your handlers are supplied as method overrides on the base ApplicationImpl class. I use several utility classes, like TreeMap, that only come with JDK 1.2.

Java-XML4J

This is Version 2.0.2 of IBM's XML parser for Java. I got this from the CD that IBM distributed to attendees of Xtech99, but you should be able to download a copy from somewhere on their alphaWorks web site. This parser has a validating mode.

This uses a SAX interface. So if you want to look at another SAX-based Java parser, you can probably use this with minimal change.

Perl

This is using XML::Parser version 2.23, which hasn't been released yet as I'm writing this, but may very well be released by the time you read this. The latest version of XML::Parser may be obtained from CPAN.

XML::Parser is built on top of James Clark's Expat parser.

Python

I'm using Pyexpat, another parser based on the Expat parser. There are several XML parsers for Python, many of which you get from the XML package put together by the XML special interest group at Python.org. I am using the Pyexpat parser from the 0.5 distribution of that package.

Conclusion

I'm hoping that other folks will take the package and try it out in different environments. (Perhaps someone will run it on NT and create an implementation for Microsoft's parser.) Also I'm hoping that others will contribute other kinds of tests, which we can post on XML.com. I'd like to see a test that was closer to XML's use for data communication. I expect in such cases that the markup density would be very high.

I've provided the programs and data files used in this program for anyone to download. (Xmlbench.tar.gz—1.98 megabytes in size. It is a tar file archive compressed with GNU zip.)