This article compares the performance of six implementations of a single program that processes XML. Each implementation uses a different XML Parser and four different languages are used: C, Java, Perl, and Python. Only a single parser is tested for Perl and Python, but two parsers each are tested for C and Java. All the parsers are stream oriented. An application must provide callbacks (or their equivalent) in order to receive information back from the parser. While some of them have a validating mode, all the parsers were run as non-validating parsers. When I say that a single program was implemented six times, I mean that each implementation produces (or should produce) exactly the same output for a given input document. But as long as that constraint was met, I attempted to write each in the most efficient manner for the given language and parser.
There aren't many surprises here. The C parsers (especially expat) are very fast, the script-language parsers are slow, and the Java parsers occupy a middle ground for larger documents. However for smaller documents (less than .5 Meg), the Perl and Python parsers are actually faster than either Java parser tested.
These tests only measure execution performance. Sometimes programmer performance is more important. I have no numbers, but I can report that for ease of implementation, the Perl and Python programs were easiest to write, the Java programs less so, and the C programs were the most difficult.
But first, let me come clean. I'm the maintainer of one of the parsers measured here, the perl module XML::Parser. I have an interest in making it look good. But I'm providing here everything I used to come up with my numbers. So you're welcome to download what I've got and try it out for yourself. Also, since I'm more experienced in Perl and C than Java and Python, gurus of those two languages may want to comb through the implementations written in them, checking for newbie mistakes.
What motivated me to run this experiment was a discussion on the performance (or lack thereof) of XML::Parser on the perl-xml mailing list. I asked the question, "How does XML::Parser compare to the competition?" Either I got no answers or the answers were comparing apples and oranges, in my opinion. (For instance, comparing XML::Parser to the Unix grep utility.) So I decided to take one of the sample programs contained in the XML::Parser distribution, xmlstats, and implement it using different parsers.
I described this program in my September article at this site. It produces a top down statistical report on the elements in an XML document: the number of occurrences, breakdown of the number of children, parents, and attributes, and character count. I think this is a good exercise for a parser since:
In any case it was at hand and I didn't have to invent some other kind of made-up application.
The main work that this program has to do outside of parsing is to order the elements in a top down fashion, yet account for the fact that element containment graph may have cycles.
I ran these tests on my Gateway Solo 5150 laptop with Red Hat 5.2 Linux installed. This is a Pentinum II machine with 64 meg of RAM. The cpuinfo reports that it tests at 232.65 bogomips.
The C compiler is gcc 2.7.2.3, the one that came with Red Hat 5.2. I'm using the pre-release version 1.2 Java Development Kit from Blackdown. The Perl version is 5.005_02, and Python 1.5.1 is installed on my machine.
All the test documents descended from REC-xml-19980210.xml, the XML version of the XML specification. The only change from it in REC.xml was the removal of the system identifier, spec.dtd, from the DOCTYPE declaration. Some of the parsers wanted it to be there if you declared it, even in non-validating mode.
The other documents are mechanicly expanded versions of REC.xml. In the case of med.xml and big.xml, the contents of the root element were just repeated, 8 and 32 times respectively. Of course, this would make them invalid even without an element model since we've repeated id attributes. (Assuming attributes named "id" are meant to be ID attributes.) But they're still well-formed.
In the case of chrmed.xml and chrbig.xml, just the text contents were repeated, 8 and 32 times respectively. This was accomplished with the use of the scale.pl perl script. Because of the way these were generated, they have no prolog and entities in the original document are pre-expanded.
REC | chrmed | med | chrbig | big | |
---|---|---|---|---|---|
size (bytes) | 159339 | 893821 | 1264240 | 3417181 | 5052472 |
markup density | 34% | 6% | 33% | 2% | 33% |
It would have been interesting to see how a document closer to RDF would have fared, but I ran out of time. I'm hoping that this article will instigate other benchmarks that look at things like RDF.
The perforance of an implementation on each test case was measured using
the unix time
command, with output being sent to the /dev/null
data sink. Actually each case was measured 3 times and the average was
taken. And these measurements didn't start until after each implementation
had a chance to position itself into the physical memory working set.
This command delivers 3 timing numbers, processors seconds allocated to the process (user time), operating system processor seconds spent in service of the process (sys time), and actual elapsed time (real time).
To avoid errors due to hand transcription, the entire test process was automated using this perl script. While this script was running, no other activity was demanded from my laptop.
REC | chrmed | med | chrbig | big | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user | sys | real | user | sys | real | user | sys | real | user | sys | real | user | sys | real | |
C-Expat | 0.060 | 0.000 | 0.050 | 0.113 | 0.007 | 0.110 | 0.360 | 0.023 | 0.380 | 0.303 | 0.040 | 0.340 | 1.420 | 0.070 | 1.480 |
C-Rxp | 0.093 | 0.017 | 0.100 | 0.297 | 0.030 | 0.320 | 0.707 | 0.037 | 0.740 | 0.947 | 0.123 | 1.060 | 2.733 | 0.207 | 2.937 |
Java-xp | 2.220 | 0.183 | 2.400 | 2.497 | 0.200 | 2.693 | 4.490 | 0.230 | 4.770 | 3.783 | 0.220 | 4.010 | 12.367 | 0.220 | 12.587 |
Java-xml4j | 2.730 | 0.197 | 3.033 | 3.173 | 0.193 | 3.470 | 6.400 | 0.230 | 6.770 | 4.997 | 0.247 | 5.280 | 18.913 | 0.280 | 19.230 |
Perl | 1.327 | 0.013 | 1.413 | 3.390 | 0.033 | 3.420 | 8.260 | 0.050 | 8.410 | 10.610 | 0.090 | 10.750 | 32.100 | 0.063 | 32.357 |
Python | 1.627 | 0.023 | 1.650 | 4.777 | 0.027 | 4.797 | 12.140 | 0.047 | 12.183 | 15.853 | 0.043 | 15.893 | 48.217 | 0.093 | 48.473 |
Four of the parsers are either written by or based on the work of James Clark. James wrote the Java xp parser and the C expat parser. Both the perl and python parsers used here are based on expat.
Trying to get each implementation to produce exactly the same results exposed some bugs in the original xmlstats program. One of those bugs was that xmlstats was counting UTF8 bytes instead of characters when it reported character statistics. Another was that it didn't coun't all the whitespace that was part of mixed content. So the xmlstats program here is not the same as the one in the XML::Parser distribution.
While two of the other parsers in this test are built on top of James Clark's expat parser, this example uses it directly. I'm using the test version of expat, identified as "Version 19990307".
Although there is a hashtable implementation in expat, it's not part of the public interface. So I have included in the Util directory, a hash implementation that I created several years ago. It draws heavily on Larry Wall's implementation of hashes in perl. This is the first public distribution of this package.
Rxp is a parser produced by Richard Tobin of Language Technology Group at the University of Edinburgh. I'm using version 1.0 that I obtained from Richard's RXP Web site.
Rxp is stream oriented, but instead of using callbacks, the application drives the main loop, asking the parser to give it the "next bit" it recognizes. Different flavors of bits are associated with different kinds of markup.
Rxp has a validating mode, which wasn't used in this test.
I'm using XP version 0.4. The latest version of XP is available from James Clark's ftp site.
Although xp provides a SAX interface, I chose to use the native xp interface, in which your handlers are supplied as method overrides on the base ApplicationImpl class. I use several utility classes, like TreeMap, that only come with JDK 1.2.
This is Version 2.0.2 of IBM's XML parser for Java. I got this off of the CD-ROM that IBM distributed to attendees of Xtech99, but you should be able to download a copy from somewhere on their Alphaworks website. This parser has a validating mode.
This uses a SAX interface. So if you want to look at another SAX based Java parser, you can probably use this with minimal change.
This is using XML::Parser version 2.23, which hasn't been released yet as I'm writing this, but may very well be released by the time you read this. The latest version of XML::Parser may be obtained from CPAN.
XML::Parser is built on top of James Clark's expat parser.
I'm using Pyexpat, another parser based on the expat parser. There are several XML parsers for Python, many of which you get from the xml package put together by the XML special interest group at python.org. I am using the pyexpat parser from the 0.5 distribution of that package.