XSLT Processor Benchmarks
Overview
| Table of Contents |
|
XML performance and XSLT |
XSLTMark is a benchmark for the comprehensive measurement of XSLT processor performance. It consists of forty test cases designed to assess important functional areas of an XSLT processor. The latest release, version 2.0, has been used to assess ten different processors. This article describes the benchmark methodology and provides a brief overview of the results.
Important Update (2001/04/04). Since the first publication of this article, errors in the methodology has been discovered. Although this doesn't affect the "headline news," it is significant. For full details please be sure to read the complete explanation below in the "Results" section. The performance chart has been updated to reflect the corrections.
XML performance and XSLT
The performance of XML processing in general is of considerable concern to both customers and engineers alike. With more and more XML-encoded data being transmitted and processed, the ability to both predict and improve XML performance is critical to delivering scalable and reliable solutions. While XSLT is a big part of delivering on the overall value proposition of XML (by allowing XML-XML data interchange and XML-HTML content presentation), it also presents the greatest performance challenge. Early anecdotal evidence showed wide disparities in real-life results, and no comprehensive benchmark tools were available to obtain more systematic assessments and comparisons. We first created and started using XSLTMark internally at DataPower in mid-2000 and released it publicly in November. By using a well-balanced test base it is possible to make predictions about likely application behavior and to select the best processor for a given application. Considerable improvements have been made by some XSLT engines over the past six months, but it is also clear that further performance improvements will be required to support the growth of XML.
Scoring System
We considered many possible scoring systems for measuring XSLT performance. A survey of existing non-XML benchmark platforms revealed that many benchmarks use abstract unit-less scores to rate performance. These scores are often composed of weighted averages of separate benchmark components that would not be otherwise aggregated. While the abstract scoring method is excellent for relative studies of performance, it lacks the value of a "real-world" number in a standard unit.
Most of the other efforts to assess XSLT performance have centered around execution time measurements for a small number of test cases (where lower scores are better). These numbers are again great for relative comparison, but they are hard to assess in an absolute sense (i.e., in relation to other types of computer processing).
For these reasons, and because XML is increasingly becoming part of the network, XSLTMark uses kilobytes-per-second as its overall score, where kilobytes are the average of input and output document size. This provides a score that is tied to both the document size and the time expended for processing. The variations between scores for different test cases are then attributable to the complexity and specifics of processing performed by the stylesheet and the structure of the input document. By examining the detailed data for the individual cases a great deal of additional knowledge can be gleaned. We conducted some preliminary tests to obtain nodes-per-second measurements, but in the end we settled on kilobytes-per-second as the best way to characterize real-world performance.
The first two releases of XSLTMark computed as a total score an aggregate KB/s measurement, computed according to the total execution time and total kilobytes processed. We like this calculation because it strictly measures the overall performance of the processor on a very broad range of tasks. It is important to understand that this aggregate score gives more weight to computationally-intensive test cases -- since the score is based on the total execution time, the "slower" test cases will have a greater effect on the score. This contrasts with an arithmetic mean of individual test case scores, which is weighted in favor of "faster" test cases.
XSLTMark 2.0 introduces a geometric mean score in addition to the aggregate score. We include this measurement because it provides an average of test case scores in a manner that is not weighted by the qualities of individual test cases. Specifically, scaling the throughput of a single test case results in a scaling of the geometric mean by a factor that does not depend on which test case is scaled.
In order to support both C/C++ and Java processors, XSLTMark uses wall clock time (elapsed real world time, rather than CPU seconds) as obtained by gettimeofday() or Java's System.currentTimeMillis(). This means that benchmarking must occur on an unloaded system, and tests should execute a sufficient number of iterations to avoid real time clock granularity and interrupt effects. Considerable time was invested in ensuring that this approach produced precise and accurate measurements.
Result Verification
The spotty compliance of many XSLT processors meant that we had to spend considerable time manually verifying the output of various tests. DataPower's internal projects also required that results be verified, so basic compliance checking was built into XSLTMark early on. The intent is not to provide a compliance test suite; although XSLTMark is comprehensive in its functional area coverage and presents a balanced performance assessment, it is not comprehensive enough for a full compliance suite. We look forward to the compliance efforts of OASIS and W3C. XSLTMark's compliance testing exists to ensure that largely incomplete processors do not receive unfairly high benchmarks. This is especially important because implementing many parts of the XSLT specification correctly means a certain performance penalty. Often a processor that does well on a subset of cases but fails many others will be considerably slower by the time it achieves full compliance. (This was the case for Transformiix, Mozilla's XSLT processor, which has made great progress in compliance but at a cost to performance).
Result verification is achieved by normalizing the output using DataPower's "dgnorm" (for "normalizer") tool. This simple C program is a SAX processor that removes insignificant whitespace, handles HTML peculiarities, alphabetically sorts attributes and does some other processing to make the output of XSLTMark stylesheets directly accessible to "diff" and byte-wise compares. After normalization, a simple comparison of a reference result and the output is performed. (Purists correctly protest that dgnorm is not a general XML normalizer; it is only suitable for normalizing the results of XSLTMark testcases).
It should be noted that sometimes there's more than one correct result, in which case it's still necessary to verify all "CHK OUTPUT" lines to make sure that they reflect a real compliance problem. This is why some benchmark results have a few manually corrected scores. Previous XSLTMark releases triggered comments from a number of prominent XSLT implementers, and some of the thorny compliance ambiguities have been resolved in the current version. The number test case was difficult to assess due to ambiguity in the XSLT specification and widespread disagreement among processors, so we omitted the associated reference file; hence the "NO REFERENCE" found in the detailed results.
Pages: 1, 2 |