Non-Validating XML Processors

September 15, 1999

Contents

• Part 1: Conformance Testing for XML Processors
• Part 2: Background on Conformance Testing
• Part 3: Non-Validating XML Processors
• Part 4: Validating XML Processors
• Part 5: Summary

The bulk of the XML processors tested were non-validating ones. For applications where enforcing the rules in a document type declaration is not necessary, a wide variety of choices and design tradeoffs is available. The Open Source parsers are invariably non-validating.

This table provides an alphabetical quick reference to the results of the analysis for non-validating processors:

Processor Name and Version	Passed Tests	Rating	Summary
Ælfred 1.2a (July 2, 1998)	865		If you want to trade off some correctness to get a very small parser, look at this one.
DataChannel XML Java Parser (April 15, 1999)	327		Don't even consider using this package until its bugs are fixed.
IBM XML4j 2.0.15 (August 30, 1999)	832		Two notable bugs are the root cause of most of the errors detected in this processor.
Lark 1.0beta (January 5, 1998)	923		More of the current generation of processors should be as conformant as this one!
Oracle XML Parser 2.0.0.2 (August 11, 1999)	904		This new entry on the processor scene is quite promising, despite rough edges where it should learn from its validating sibling.
Silfide XML Parser (SXP) 0.88 (July 25, 1999)	731		Better choices are available for standalone XML processors.
Sun ``Java Project X'' TR2 (May 21, 1999)	1065		No conformance violations detected.
XP 0.5beta (January 2, 1999)	1050		This processor is all but completely conformant.

More detailed discussion of each processor is below, in alphabetical order, with links to the complete testing reports.

Ælfred

Processor Name:	Ælfred
Version:	1.2a (July 2, 1998)
Type:	Non-Validating
DOM Bundled:	No
Size of JAR File:	65 KBytes with SAX (uncompressed)
Download From:	http://www.microstar.com/aelfred.html

This processor is uniquely light weight; at about 33 KBytes of JAR file size (compressed, and including the SAX interfaces), it was designed for downloading in applets and explicitly traded off conformance for size. While it has not been updated in some time, it is still widely used.

Rating:
Full Test Results:	report-aelf.html
Raw Results:	Passed 865 (of 1065)
Adjusted Results:	Passed 865

This processor rejects a certain number of documents it shouldn't, and isn't clear about why it did so. (Good diagnostics would have cost space, which this processor chose not to spend that way.) There seems to be a pattern where the processor expects a quoted string of some kind, and is surprised by what it found instead.

There are cases where it's clear why those documents were rejected. For example, syntax that looks like a parameter entity reference but is found inside of a comment should be ignored, but isn't. The XML spec itself uses such constructs in its DTD, but its errata haven't yet been updated to address the issue of exactly where parameter entities get expanded and where they don't. (Were there not the example of the XML spec itself, and feedback from the XML editors on this issue, it would seem that this processor was in compliance.)

Character references that would expand to Unicode surrogate pairs are inappropriately rejected. Nobody has any real reason to use such pairs yet, so in practice this isn't a problem. And the data provided as output is also not correct in all cases.

The bulk of this processor's nonconformance lies in the fact that it consciously avoids checking for certain errors, to reduce size and to some extent to increase speed. For example, characters are rarely checked for being in the right ranges, saving code in several locations both to make those checks and to report the associated errors. Certain syntax rules, like "--" being illegal in comments and "]]>" being illegal in normal text, are also ignored.

In short, most of the time if you feed this processor a legal XML document it will parse it without needing many resources. But if you feed it illegal XML, it won't be good about telling you that anything was wrong; or exactly what was wrong.

DataChannel XML Java Parser

Processor Name:	DataChannel XML Java Parser
Version:	(April 15, 1999)
Type:	Non-Validating
DOM Bundled:	Yes
Size of JAR File:	448 KBytes (uncompressed)
Download From:	http://xdev.datachannel.com/

This parser is part of a package developed with the assistance of Microsoft, providing a Java implementation of much of the XML manipulation functionality in Internet Explorer 5. While it is freely available, support (such as bug fixes) costs. Validation is available in the package, but not through the SAX API.

Rating:
Full Test Results:	report-dcxjp.html
Raw Results:	Passed 627 (of 1065)
Adjusted Results:	Passed 327

This SAX parser is not currently usable. It rejected almost all documents, due to a simple bug that no other parser has needed to stumble over.

These rejections include all of the direct failures, as well as the huge number of "false passes" on the negative tests ... almost three hundred were caused by this error alone. (Few other parsers had that many failures at all; none had as many "false passes".) The exception is a TokenizerException like the following:


Unrecognized token following a '<!' sequence!

(line 1, position 4, file:/db/xml/xmlconf/xmltest/valid/sa/093.xml)

For the record, the document in question is this innocuous snippet, primarily useful as an output test (the content of the <doc> element has CRLF and CR line ends, which should normalize to three LF characters):


<!DOCTYPE doc [

<!ELEMENT doc (#PCDATA)>

]>

<doc>





</doc>

Many of the documents which this processor accepted were documents which contained illegal XML characters, and so they should have caused fatal errors to be reported.

Speaking as a systems developer, it's hard for me to believe that this package was released without knowing about these bugs, and is harder to understand why it wasn't fixed in the months since it was first released. If DataChannel wasn't using the XMLTEST cases all along, it should have been. In any case, they were definitively informed about this bug in the first week of August, and it remains unfixed at this writing.

IBM XML4j

Processor Name:	IBM XML4j
Version:	2.0.15 (August 30, 1999)
Type:	Non-Validating
DOM Bundled:	Yes
Size of JAR File:	722 KBytes (uncompressed)
Download From:	http://www.alphaworks.ibm.com/

IBM's package includes several processor configurations, including validating and DOM-oriented parsers, and it works well with other XML software provided by the company. It gets regular updates. As "alphaworks" software, it has no guarantees. Commercial usage permission can be granted.

Rating:
Full Test Results:	report-xml4j-nv.html
Raw Results:	Passed 902 (of 1065)
Adjusted Results:	Passed 832

What sticks out the most about this processor is that just two clear cases of internal errors seem to dominate the test failures, making it reject many well formed documents which it should have accepted. These also mask other errors that the processor should have reported. (The same symptoms exists in the validating processor, which shares the same core engine.)

Thankfully, those internal errors don't show up often enough to keep this processor from correctly handling the bulk of the test suite. If they were fixed, this processor might do quite well on a conformance evaluation, rather than being below the median.

Beyond those two bugs, a few other problems also turned up. This processor seems to have some problems reading UTF-16 text. In some cases it rejects XML characters that it should accept. That's significant since the result was rejecting many of the XML documents which used non-English characters.

What were those bugs? Well over half of the falsely rejected documents (and significant numbers of the incorrectly rejected ones) are cases where the processor

expects an end tag inappropriately ... often labeled as "null", a strong indication of an internal error (null pointer) particularly since that tag name wasn't used; or labeled with the target of a processing instruction, another such indication.
inappropriately reports a recursive entity expansion ... it appears that this diagnostic is produced in cases of correct entity use, as well as situations that don't involve entities at all.

These were reported to IBM when this processor was first released, and it is not clear whether any of the subsequent releases have reduced the frequency of these false errors.

Lark

Processor Name:	Lark
Version:	1.0beta (January 5, 1998)
Type:	Non-Validating
DOM Bundled:	No
Size of JAR File:	135 Kbytes (uncompressed, with SAX classes)
Download From:	http://www.textuality.com/Lark/

Lark is one of the older XML processors still in use. It was written by Tim Bray, one of the editors of the XML specification, in conjunction with that specification, partly to establish that the specification was in fact implementable. It it is not actively being maintained.

Rating:
Full Test Results:	report-lark.html
Raw Results:	Passed 923 (of 1065)
Adjusted Results:	Passed 923

This processor rejects a few too many documents which it should accept, and doesn't produce the correct output in a number of cases. However, it is quite good at rejecting malformed documents for the correct reasons.

Quite a lot of the documents that this rejects have XML declarations which aren't quite what the processor expects, in some cases seemingly due to having standalone declarations. Others use some name characters which aren't accepted. There appear to be a declaration ordering constraint imposed by the processor, and difficulties handling conditional sections. Character references that expand to surrogate pairs are not accepted.

Oracle XML Parser

Processor Name:	Oracle XML Parser
Version:	2.0.0.2 (August 11, 1999)
Type:	Non-Validating
DOM Bundled:	Yes
Size of JAR File:	556 KBytes (uncompressed)
Download From:	http://www.oracle.com/xml/

Oracle has a new version of their package, which appears promising. It includes XSL/T support, and has a compact API that supports both validating and non-validating processing. At this time, this implementation is not licensed for commercial use.

As this article went to press, a new version of the Oracle XML Parser for Java, v2.0.2, was released. In addition to some bug fixes in the XML processor, which do not seem to affect the overall ratings for these processors, this version includes support for the August XSL/T working draft.

Rating:
Full Test Results:	report-oracle-nv.html
Raw Results:	Passed 904 (of 1065)
Adjusted Results:	Passed 904

This processor was quite good about not rejecting documents it should have accepted, but needs some work yet on reporting the correct data and on rejecting some illegal documents. Its diagnostics made the task of analyzing its test results easy; I was able to analyse the negative test results much more thoroughly than for most other processors. That will in turn make life easier for the users of applications built with this processor.

With respect to the output from this processor, there were a handful of cases where incorrect data was reported. From analyzing a subset of these cases, I noticed:

Second declarations of attributes not being ignored;
Incorrect whitespace treatment in attribute values and in entity expansions;
Misinterpretation of multibyte UTF-8 input characters;
SAX DTDHandler callbacks not being invoked.

There appear to be some problems with character set handling. Unicode surrogate pairs are not handled correctly, and some text encoded in UTF-16 was incorrectly rejected. Another issue with character handling is that some characters which should cause fatal errors (such as form feeds, misplaced byte order marks, and some characters in PUBLIC identifiers) are permitted.

Perhaps the most worrisome case of wrongly accepting a document was accepting a document which omitted an end tag. This processor even accepts SGML tag minimization and exception specifications in its element type declarations. Even if this were intentional, it is a substantial bug to enable this by default on a processor calling itself an "XML" processor. The acceptance of such SGML syntax is one of the more notable patterns of errors in this processor. It is also puzzling since its validating sibling handled such syntax correctly (rejecting it with fatal errors).

There were a variety of cases where array indexing exceptions were reported, or where certain syntax was incorrectly accepted. Many of those are attributable to this being an early release.

There are some issues with the reporting of errors through SAX; in many cases, the processor doesn't pass the correct exception object through, but instead substitutes it for a different one. This can affect application code.

Silfide XML Parser (SXP)

Processor Name:	Silfide XML Parser (SXP)
Version:	0.88 (July 25, 1999)
Type:	Non-Validating
DOM Bundled:	Yes
Size of JAR File:	279 KBytes (or 595 KBytes uncompressed)
Download From:	http://www.loria.fr/projects/XSilfide/

SXP is part of the "Silfide" project, a client/server based environment for distributing language resources built with XML and Java. It is currently a prototype, and is not available for commercial use. Silfide incorporates a 100% Pure Java web server, and SXP implements early drafts of XPointer and XLink.

Rating:
Full Test Results:	report-sxp.html
Raw Results:	Passed 761 (of 1065)
Adjusted Results:	Passed 731

This component of the Silfide system appears not to have received as much attention as other parts of it. Also, it thrashes on systems with only 64 MBytes of physical memory.

Support for UTF-16 and UTF-8 encodings is not strong; the encodings don't appear to be autodetected, so their data is handled incorrectly. While many other errors are visible, the diagnostics are not often clear about what was expected.

Output tests seem to fail in large part because the processor reports character data outside the context of an element, where only markup exists.

A number of illegal characters were accepted, including malformed surrogate pairs and out-of-range characters, which should have been rejected.

Sun ``Java Project X''

Processor Name:	Sun ``Java Project X''
Version:	TR2 (May 21, 1999)
Type:	Non-Validating
DOM Bundled:	Yes
Size of JAR File:	132 KBytes (or 246 KBytes uncompressed)
Download From:	http://java.sun.com/products/xml

This package includes validating and nonvalidating parsers, which may optionaly be connected to a DOM implementation. Sun plans to turn this processor into the reference implementation of a Java Standard Extension for XML, with APIs that are yet to be specified (beyond inclusion of SAX and DOM). Commercial use is permitted, and this package has been relatively stable for some time.

Rating:
Full Test Results:	report-sun-nv.html
Raw Results:	Passed 1065 (of 1065)
Adjusted Results:	Passed 1065

This processor reported no conformance errors. That was a design goal of the processor.

I analysed the negative results when I worked at Sun, and believe that every diagnostic reports the correct error. (This is the only parser that I can report I have carefully examined for that issue.)

XP

Processor Name:	XP
Version:	0.5beta (January 2, 1999)
Type:	Non-Validating
DOM Bundled:	No
Size of JAR File:	166 KBytes (uncompressed, without SAX classes)
Download From:	http://www.jclark.com/xml/xp/

This processor was written by James Clark, who served as technical lead for the XML spec and as editor of the XSL/T specification. (He's also written the most widely used SGML implementation, and done many other things in this community.) It is available for commercial use.

Rating:
Full Test Results:	report-xp.html
Raw Results:	Passed 1050 (of 1065)
Adjusted Results:	Passed 1050

This is one of the most conformant XML processors available. In contrast to the variety of problems shown by most other processors, this test suite identifies only three categories of specification violation in XP:

Violations of two validity constraints relating to nesting of parameter entities are treated as fatal errors, instead of nonfatal ones as a validating processor would, or even as non-errors as most non-validating processors do. The author of this processor has communicated to me that he does not see fixing these as necessary, and I can concur. (These account for the bulk of the reported errors.)
Treatment of parameter entities does not strictly match what is expected by the test suite. These relate to two interpretation questions which have been raised to the W3C but not answered through the errata process for the XML specification.
Some output violations relate to reporting notation declarations and normalizing attributes.

In short, a reasonable design choice (based on what I think of as a specification issue), some specification issues, and what appears to be two minor problems. Wouldn't it be nice if all software was this close to its specification!