Summary: What The Tests Show

September 15, 1999

Contents

• Part 1: Conformance Testing for XML Processors
• Part 2: Background on Conformance Testing
• Part 3: Non-Validating XML Processors
• Part 4: Validating XML Processors
• Part 5: Summary

Most of these conformance tests (the XMLTEST subset) have been available for quite some time. So you might expect that at this stage of the XML life cycle most processors would conform quite well to the specification. Surprisingly, that is not the case. Few processors passed all the XMLTEST cases, much less the whole OASIS suite. The class median on this open book test was about eighty percent, which suggests that many implementors just haven't applied any real effort to getting conformance right.

Common Types of Errors

A few types of errors are common in today's XML processors:

Sloppy programming is sometimes visible, where one or two basic errors cause many problems parsing documents. One could expect that of early stage implementations, but those are not the ones exhibiting such problems most strongly.
Many processors reject legal documents. This can be extremely troublesome in terms of interoperability.
- Several have problems handling UTF-16 encodings. Support for such encodings is available from the Java platform; the problems shown by the processors may relate to faulty detection of documents using such sixteen bit characters.
- A number had problems handling element and tag names with characters outside the ASCII range. This was evident even though this version of the OASIS tests doesn't fully cover those two pages of the XML specification; thorough coverage could involve up to 128,000 tests (one for each of 64,000 Unicode values in initial and subsequent characters, used in various types of XML identifiers).
- Handling of XML characters that require use of Unicode surrogate pairs is as yet inconsistent. Given the lack of any current real-world need for such characters, this is not a surprise -- only conformance testing is currently likely to turn up such problems. This is expected to change in the next few years, however.
Many processors accept documents they should reject:
- They do not appear to pay attention to rules that restrict what characters may exist in content, or in PUBLIC identifiers. While that latter is probably just a common oversight, the former might be an attempt at false efficiency. (I have disabled such checks in some processors, and have found the difference in speed to be hard to observe.)
- Parameter entity handling is often troublesome. This is an area where the XML specification does not completely describe the intended behavior, while the errata have not yet identified which parts of a DTD are not subject to parameter entity expansion, so some confusion here is to be expected.
With respect to validating processors, it was an unpleasant shock to discover that only one of them actually meets a basic requirement of the XML specification, that validity errors be continuable.
The validity constraints related to nesting of parameter entities get inconsistent treatment. One non-validating processor treats them as well formedness constraints; one validating one ignores them entirely. As a systems designer, I see no benefit to the confusion that could arise from expecting validating and non-validating processors to differ here -- and see a drawback, in terms of code complexity (which in this case requires tracking information which is otherwise irrelevant). It'd be preferable to have these two constraints be gone, or be well formedness constraints.
Output tests caused many trouble reports. It can be tricky to ensure that line ends and whitespace are handled correctly, though normalization of attribute values and PUBLIC identifiers should be easier.

SAX conformance was generally reasonable, although a few of the processors (or their drivers) had problems with error reporting or otherwise providing the correct data to the application. The fact that a less conformant SAX processor can be substituted for a more conformant one is truly advantageous when conformance is a goal.

Some Observations

Some XML processors are part of libraries where they are tightly integrated with packages for other XML software components, such as XSL/T, XPath, and of course DOM. Such integration can preclude customers from using conformant processors, since such libraries are often set up to rely on proprietary processor APIs rather than using only standard SAX functionality. It would be better to avoid that sort of coupling, which is mostly necessary (for now) when the contents of a DTD need to be accessed.

It is an interesting reflection on the current debate about "Open Source" versus proprietary software that while some of the most conformant processors are commercially controlled and offered without an Open Source license, equally conformant Open Source processors have been available for much longer. The exception is in the case of validating processors; no Open Source validating XML processors have yet been provided.

The case of Ælfred is also instructive. That processor made some specific tradeoffs against conformance, in favor of small size. Given that, it is curious that it is more or less the median processor represented in this testing. None of the other processors claim to have made such tradoffs; why is Ælfred not relatively worse?

I feel I should touch on performance impacts here. While it is surely possible to speed up by a few instructions here and there by being loose with conformance, it is also true that the most highly conformant XML processors are also among the fastest ones. There is no clear need to trade off those aspects of quality against each other.

To my understanding, none of the processors described here have significantly improved their conformance levels since their initial releases. I would much like to see this dynamic be changed!

Conclusion

I hope that both users and providers of XML processors will take these testing results to heart. Users should build systems using only conformant processors, and should ask their providers to address all conformance shortcomings in their software. Those providers should be addressing such problems proactively, and fixing customer-reported conformance bugs. This is particularly important for ``integrated'' packages, where using higher level APIs imposes the use of a particular processor that may not be very conformant. Such higher level APIs could generally be provided using any processor supporting the SAX API.

Widespread and close conformance to core web standards, such as XML, will help the web community avoid recreating the pain coming from inconsistent support of standards like CSS 1 and HTML 4 in today's systems. That chaos is an ongoing drain on the budgets of web site providers, and causes a lower quality web experience than users deserve. It is in everyone's interest to keep that from repeating with XML. Only now, near the beginning of the XML adoption curve, is there a real chance to prevent such problems from taking root.