Validating XML Processors

September 15, 1999

Contents

• Part 1: Conformance Testing for XML Processors
• Part 2: Background on Conformance Testing
• Part 3: Non-Validating XML Processors
• Part 4: Validating XML Processors
• Part 5: Summary

Not many validating XML processors are available at this time, and most of them are available with a non-validating sibling. The suppliers are all commercial; there are no Open Source validating processors supporting the SAX API, so far as I am currently aware. Applications needing to enforce document type declarations do have options available in other programming languages. Notably, C/C++ packages are freely available, sometimes with SGML support.

This table provides an alphabetical quick reference to the results of the analysis for validating processors:

Processor Name and Version	Passed Tests	Rating	Summary
IBM XML4j 2.0.15 (August 30, 1999)	832		This has the problems of its non-validating sibling, and does not permit validity errors to be continued.
Microsoft MSXML in Java JVM 5.00.3186 (August 24, 1999)	615		It's curious that this was bundled into Microsoft's Java VM without fixing its well known conformance bugs. Avoid using it.
Oracle XML Parser 2.0.0.2 (August 11, 1999)	871		If this just permitted continuation of validity errors, it would be a top contender.
Sun ``Java Project X'' TR2 (May 21, 1999)	1065		No conformance violations detected.

More detailed discussion of each processor is below, in alphabetical order, with links to the complete testing reports.

IBM XML4j

Processor Name:	IBM XML4j
Version:	2.0.15 (August 30, 1999)
Type:	Validating
DOM Bundled:	Yes
Size of JAR File:	722 KBytes (uncompressed)
Download From:	http://www.alphaworks.ibm.com/

This is the validating version of IBM's processor. See the coverage of the non-validating processor for more details.

Rating:
Full Test Results:	report-xml4j-val.html
Raw Results:	Passed 902 (of 1065)
Adjusted Results:	Passed 832

The same problems that show up in the non-validating processor also show up in the validating one ... in fact, the processor appears to be doing exactly the same thing in both cases! (I confess this discovery was quite a surprise to me; it may be that this version of the IBM processor is a regression from earlier releases in this respect.)

No validity errors were reported as such; all the invalid documents caused incorrect reports of fatal errors.

Microsoft MSXML in Java

Processor Name:	Microsoft MSXML
Version:	JVM 5.00.3186 (August 24, 1999)
Type:	Validating
DOM Bundled:	No
Size of JAR File:	N/A (bundled with JVM)
Download From:	http://www.microsoft.com/java/

Note that although this parser was originally called MSXML, Microsoft currently uses that term exclusively for its IE5 COM parser ("MSXML.DLL"). The more recent name for the Java parser is the "Microsoft XML Parser in Java". Please do not interpret these results as reflecting conformance for the C parser found in the Internet Explorer 5 web browser.

The MSXML package was originally intended to provide XML support for Internet Explorer 4 users. It was recently bundled with Microsoft's latest version (build 3186) of their Java Virtual Machine and SDK 3.2. A SAX driver is separately available. None of the standard programming interfaces (SAX, DOM) are bundled.

Rating:
Full Test Results:	report-msxml.html
Raw Results:	Passed 648 (of 1065)
Adjusted Results:	Passed 615

This processor needs a separate SAX driver, since Microsoft has not yet offered support for the SAX API. The processor rejects a substantial number of documents that it should accept, producing fatal errors:

The processor misreports validity errors as fatal errors.
It ignores many validity errors.
A wide range of common grammatical constructs are rejected, such as
- character and entity references in attribute values
- some processing instructions
- attribute names such as "xml:space"
- conditional sections
Not all XML names characters are accepted; for example, some Japanese characters were rejected in names.
In valid documents, some declarations were both ignored and reported as missing (validity violations, misreported as fatal errors). These may be inappropriate expectations of a particular declaration ordering.

In addition, this processor has entered infinite loops when asked to parse some documents. This has been observed with UTF-16 input text (which sometimes produces less drastic errors) as well as with some numeric character references. Such errors are quite dangerous.

Many output tests failed; more than seems usual.

As for documents which should have been rejected but were in fact accepted, there were many of those also:

Various illegal processing instructions were accepted, including the <?XML ...?> style ones, facilitating islands of Microsoft-only "pseudo-XML" which have for a long time been troublesome;
Many characters disallowed by XML were accepted by this processor, such as many control codes and the 'escape' character. This includes references to such characters, even when they could not be represented in Unicode (even with surrogate pairs), and UTF-8 encodings of such unrepresentable characters.
Characters that should have been disallowed in PUBLIC identifiers were allowed.
SGML style comments were accepted
Text with embedded ']]>' was accepted
Constructs like <element att="1" att="2"/> were not rejected;
Illegal DTD syntax was accepted

Support for multiple text encodings seems weak; documents declared as being encoded in "UTF-16" were inappropriately rejected. Japanese encodings were neither rejected nor handled consistently.

This test harness shows that when a SAX ErrorHandler callback is used to report an exception, that exception will not be passed back to the application through the Parser.parse() call. This appears to be a driver issue with a simple fix.

Oracle XML Parser

Processor Name:	Oracle XML Parser
Version:	2.0.0.2 (August 11, 1999)
Type:	Validating
DOM Bundled:	Yes
Size of JAR File:	556 KBytes (uncompressed)
Download From:	http://www.oracle.com/xml/

This is the validating mode of Oracle's new processor. See the coverage of that non-validating processor.

Rating:
Full Test Results:	report-oracle-val.html
Raw Results:	Passed 871 (of 1065)
Adjusted Results:	Passed 871

This stumbled on accepting a few valid documents, and there were some difficulties handling NMTOKENS attribute lists and handling a mixed content specification.

This shared some of the problems that its non-validating sibling had with reading UTF-16 and multibyte UTF-8 characters. Similarly, it also had problems with names which actually tried to exercise the variety of name characters permitted by the XML specifications.

The output was much more correct than the output from its non-validating sibling. That's a bit puzzling, but it does suggest that the core engine needs only minor tweaks to make sure they're both equally correct.

Also of note is the fact that this validating processor accepted none of the SGML-isms that the non-validating one allowed in its input DTD syntax. Again, the non-validating processor should be acting more like the validating one.

Other than rejecting documents it shouldn't, the problems with this processor mostly related to validity violations that were not reported at all, including:

Improper nesting of PEs in declarations (neither of the nesting constraints seems to be tested);
Documents improperly declared as standalone;
Several cases of content model violations (such as for an incorrect number of elements, and some cases of data inside an EMPTY element);
Illegal attribute value defaults were accepted for ENTITY/ENTITIES attributes.
A reference to an undeclared notation was ignored.

Internal errors were reported in various cases when validating illegal documents. These included array bounds exceptions (e.g. with NMTOKENS attributes) and null pointer exceptions working with IDREF/IDREFS values.

Significantly, none of the validity errors were continuable; some were correctly reported as non-fatal through SAX, but then the processor refused to continue processing. (As noted above, some were incorrectly reported as fatal errors in the first place.)

Sun ``Java Project X''

Processor Name:	Sun ``Java Project X''
Version:	TR2 (May 21, 1999)
Type:	Validating
DOM Bundled:	Yes
Size of JAR File:	132 KBytes (or 246 KBytes uncompressed)
Download From:	http://java.sun.com/products/xml

This is the validating mode of the non-validating processor presented elsewhere.

Rating:
Full Test Results:	report-sun-val.html
Raw Results:	Passed 1065 (of 1065)
Adjusted Results:	Passed 1065

This processor reported no conformance errors. That was a design goal of the processor.

I analysed the negative results when I worked at Sun, and believe that every diagnostic reports the correct error. (This is the only parser that I can report I have carefully examined for that issue.)