A Performance Comparison of 6 Stream Oriented XML Parsers

Introduction

This article compares the performance of six implementations of a single program that processes XML. Each implementation uses a different XML Parser and four different languages are used: C, Java, Perl, and Python. Only a single parser is tested for Perl and Python, but two parsers each are tested for C and Java. All the parsers are stream oriented. An application must provide callbacks (or their equivalent) in order to receive information back from the parser. While some of them have a validating mode, all the parsers were run as non-validating parsers. When I say that a single program was implemented six times, I mean that each implementation produces (or should produce) exactly the same output for a given input document. But as long as that constraint was met, I attempted to write each in the most efficient manner for the given language and parser.

Executive Summary

There aren't many surprises here. The C parsers (especially expat) are very fast, the script-language parsers are slow, and the Java parsers occupy a middle ground for larger documents. However for smaller documents (less than .5 Meg), the Perl and Python parsers are actually faster than either Java parser tested.

These tests only measure execution performance. Sometimes programmer performance is more important. I have no numbers, but I can report that for ease of implementation, the Perl and Python programs were easiest to write, the Java programs less so, and the C programs were the most difficult.

Full Disclosure

But first, let me come clean. I'm the maintainer of one of the parsers measured here, the perl module XML::Parser. I have an interest in making it look good. But I'm providing here everything I used to come up with my numbers. So you're welcome to download what I've got and try it out for yourself. Also, since I'm more experienced in Perl and C than Java and Python, gurus of those two languages may want to comb through the implementations written in them, checking for newbie mistakes.

What motivated me to run this experiment was a discussion on the performance (or lack thereof) of XML::Parser on the perl-xml mailing list. I asked the question, "How does XML::Parser compare to the competition?" Either I got no answers or the answers were comparing apples and oranges, in my opinion. (For instance, comparing XML::Parser to the Unix grep utility.) So I decided to take one of the sample programs contained in the XML::Parser distribution, xmlstats, and implement it using different parsers.

The humble xmlstats program

I described this program in my September article at this site. It produces a top down statistical report on the elements in an XML document: the number of occurrences, breakdown of the number of children, parents, and attributes, and character count. I think this is a good exercise for a parser since:

  1. it does something useful
  2. it processes every start tag and piece of text
  3. yet it doesn't do too much work beyond parsing.

In any case it was at hand and I didn't have to invent some other kind of made-up application.

The main work that this program has to do outside of parsing is to order the elements in a top down fashion, yet account for the fact that element containment graph may have cycles.

Test Environment

I ran these tests on my Gateway Solo 5150 laptop with Red Hat 5.2 Linux installed. This is a Pentinum II machine with 64 meg of RAM. The cpuinfo reports that it tests at 232.65 bogomips.

The C compiler is gcc 2.7.2.3, the one that came with Red Hat 5.2. I'm using the pre-release version 1.2 Java Development Kit from Blackdown. The Perl version is 5.005_02, and Python 1.5.1 is installed on my machine.

Test Documents

All the test documents descended from REC-xml-19980210.xml, the XML version of the XML specification. The only change from it in REC.xml was the removal of the system identifier, spec.dtd, from the DOCTYPE declaration. Some of the parsers wanted it to be there if you declared it, even in non-validating mode.

The other documents are mechanicly expanded versions of REC.xml. In the case of med.xml and big.xml, the contents of the root element were just repeated, 8 and 32 times respectively. Of course, this would make them invalid even without an element model since we've repeated id attributes. (Assuming attributes named "id" are meant to be ID attributes.) But they're still well-formed.

In the case of chrmed.xml and chrbig.xml, just the text contents were repeated, 8 and 32 times respectively. This was accomplished with the use of the scale.pl perl script. Because of the way these were generated, they have no prolog and entities in the original document are pre-expanded.

Test document statistics
RECchrmedmedchrbigbig
size (bytes)159339893821126424034171815052472
markup density34%6%33%2%33%

It would have been interesting to see how a document closer to RDF would have fared, but I ran out of time. I'm hoping that this article will instigate other benchmarks that look at things like RDF.

Methodology and Results

The perforance of an implementation on each test case was measured using the unix time command, with output being sent to the /dev/null data sink. Actually each case was measured 3 times and the average was taken. And these measurements didn't start until after each implementation had a chance to position itself into the physical memory working set.

This command delivers 3 timing numbers, processors seconds allocated to the process (user time), operating system processor seconds spent in service of the process (sys time), and actual elapsed time (real time).

To avoid errors due to hand transcription, the entire test process was automated using this perl script. While this script was running, no other activity was demanded from my laptop.

XML Parser Performance Statistics
RECchrmedmedchrbigbig
usersysrealusersysrealusersysrealusersysrealusersysreal
C-Expat 0.0600.0000.050 0.1130.0070.110 0.3600.0230.380 0.3030.0400.340 1.4200.0701.480
C-Rxp 0.0930.0170.100 0.2970.0300.320 0.7070.0370.740 0.9470.1231.060 2.7330.2072.937
Java-xp 2.2200.1832.400 2.4970.2002.693 4.4900.2304.770 3.7830.2204.010 12.3670.22012.587
Java-xml4j 2.7300.1973.033 3.1730.1933.470 6.4000.2306.770 4.9970.2475.280 18.9130.28019.230
Perl 1.3270.0131.413 3.3900.0333.420 8.2600.0508.410 10.6100.09010.750 32.1000.06332.357
Python 1.6270.0231.650 4.7770.0274.797 12.1400.04712.183 15.8530.04315.893 48.2170.09348.473

The competitors

Four of the parsers are either written by or based on the work of James Clark. James wrote the Java xp parser and the C expat parser. Both the perl and python parsers used here are based on expat.

Trying to get each implementation to produce exactly the same results exposed some bugs in the original xmlstats program. One of those bugs was that xmlstats was counting UTF8 bytes instead of characters when it reported character statistics. Another was that it didn't coun't all the whitespace that was part of mixed content. So the xmlstats program here is not the same as the one in the XML::Parser distribution.

C-Expat

While two of the other parsers in this test are built on top of James Clark's expat parser, this example uses it directly. I'm using the test version of expat, identified as "Version 19990307".

Although there is a hashtable implementation in expat, it's not part of the public interface. So I have included in the Util directory, a hash implementation that I created several years ago. It draws heavily on Larry Wall's implementation of hashes in perl. This is the first public distribution of this package.

C-Rxp

Rxp is a parser produced by Richard Tobin of Language Technology Group at the University of Edinburgh. I'm using version 1.0 that I obtained from Richard's RXP Web site.

Rxp is stream oriented, but instead of using callbacks, the application drives the main loop, asking the parser to give it the "next bit" it recognizes. Different flavors of bits are associated with different kinds of markup.

Rxp has a validating mode, which wasn't used in this test.

Java-xp

I'm using XP version 0.4. The latest version of XP is available from James Clark's ftp site.

Although xp provides a SAX interface, I chose to use the native xp interface, in which your handlers are supplied as method overrides on the base ApplicationImpl class. I use several utility classes, like TreeMap, that only come with JDK 1.2.

Java-xml4j

This is Version 2.0.2 of IBM's XML parser for Java. I got this off of the CD-ROM that IBM distributed to attendees of Xtech99, but you should be able to download a copy from somewhere on their Alphaworks website. This parser has a validating mode.

This uses a SAX interface. So if you want to look at another SAX based Java parser, you can probably use this with minimal change.

Perl

This is using XML::Parser version 2.23, which hasn't been released yet as I'm writing this, but may very well be released by the time you read this. The latest version of XML::Parser may be obtained from CPAN.

XML::Parser is built on top of James Clark's expat parser.

Python

I'm using Pyexpat, another parser based on the expat parser. There are several XML parsers for Python, many of which you get from the xml package put together by the XML special interest group at python.org. I am using the pyexpat parser from the 0.5 distribution of that package.