Validating XML Processors
September 15, 1999
Not many validating XML processors are available at this time, and most of them are available with a non-validating sibling. The suppliers are all commercial; there are no Open Source validating processors supporting the SAX API, so far as I am currently aware. Applications needing to enforce document type declarations do have options available in other programming languages. Notably, C/C++ packages are freely available, sometimes with SGML support.
This table provides an alphabetical quick reference to the results of the analysis for validating processors:
2.0.15 (August 30, 1999)
This has the problems of its non-validating sibling, and does not permit validity errors to be continued.
Microsoft MSXML in Java
JVM 5.00.3186 (August 24, 1999)
It's curious that this was bundled into Microsoft's Java VM without fixing its well known conformance bugs. Avoid using it.
Oracle XML Parser
22.214.171.124 (August 11, 1999)
If this just permitted continuation of validity errors, it would be a top contender.
Sun ``Java Project X''
TR2 (May 21, 1999)
No conformance violations detected.
More detailed discussion of each processor is below, in alphabetical order, with links to the complete testing reports.
|Processor Name:||IBM XML4j|
|Version:||2.0.15 (August 30, 1999)|
|Size of JAR File:||722 KBytes (uncompressed)|
This is the validating version of IBM's processor. See the coverage of the non-validating processor for more details.
|Full Test Results:||report-xml4j-val.html|
|Raw Results:||Passed 902 (of 1065)|
|Adjusted Results:||Passed 832|
The same problems that show up in the non-validating processor also show up in the validating one ... in fact, the processor appears to be doing exactly the same thing in both cases! (I confess this discovery was quite a surprise to me; it may be that this version of the IBM processor is a regression from earlier releases in this respect.)
No validity errors were reported as such; all the invalid documents caused incorrect reports of fatal errors.
|Processor Name:||Microsoft MSXML|
|Version:||JVM 5.00.3186 (August 24, 1999)|
|Size of JAR File:||N/A (bundled with JVM)|
Note that although this parser was originally called MSXML, Microsoft currently uses that term exclusively for its IE5 COM parser ("MSXML.DLL"). The more recent name for the Java parser is the "Microsoft XML Parser in Java". Please do not interpret these results as reflecting conformance for the C parser found in the Internet Explorer 5 web browser.
The MSXML package was originally intended to provide XML support for Internet Explorer 4 users. It was recently bundled with Microsoft's latest version (build 3186) of their Java Virtual Machine and SDK 3.2. A SAX driver is separately available. None of the standard programming interfaces (SAX, DOM) are bundled.
|Full Test Results:||report-msxml.html|
|Raw Results:||Passed 648 (of 1065)|
|Adjusted Results:||Passed 615|
This processor needs a separate SAX driver, since Microsoft has not yet offered support for the SAX API. The processor rejects a substantial number of documents that it should accept, producing fatal errors:
- The processor misreports validity errors as fatal errors.
- It ignores many validity errors.
- A wide range of common grammatical constructs are rejected, such as
- character and entity references in attribute values
- some processing instructions
- attribute names such as "xml:space"
- conditional sections
- Not all XML names characters are accepted; for example, some Japanese characters were rejected in names.
- In valid documents, some declarations were both ignored and reported as missing (validity violations, misreported as fatal errors). These may be inappropriate expectations of a particular declaration ordering.
In addition, this processor has entered infinite loops when asked to parse some documents. This has been observed with UTF-16 input text (which sometimes produces less drastic errors) as well as with some numeric character references. Such errors are quite dangerous.
Many output tests failed; more than seems usual.
As for documents which should have been rejected but were in fact accepted, there were many of those also:
- Various illegal processing instructions were accepted, including the <?XML ...?> style ones, facilitating islands of Microsoft-only "pseudo-XML" which have for a long time been troublesome;
- Many characters disallowed by XML were accepted by this processor, such as many control codes and the 'escape' character. This includes references to such characters, even when they could not be represented in Unicode (even with surrogate pairs), and UTF-8 encodings of such unrepresentable characters.
- Characters that should have been disallowed in PUBLIC identifiers were allowed.
- SGML style comments were accepted
- Text with embedded ']]>' was accepted
- Constructs like <element att="1" att="2"/> were not rejected;
- Illegal DTD syntax was accepted
Support for multiple text encodings seems weak; documents declared as being encoded in "UTF-16" were inappropriately rejected. Japanese encodings were neither rejected nor handled consistently.
This test harness shows that when a SAX ErrorHandler callback is used to report an exception, that exception will not be passed back to the application through the Parser.parse() call. This appears to be a driver issue with a simple fix.
|Processor Name:||Oracle XML Parser|
|Version:||126.96.36.199 (August 11, 1999)|
|Size of JAR File:||556 KBytes (uncompressed)|
This is the validating mode of Oracle's new processor. See the coverage of that non-validating processor.
|Full Test Results:||report-oracle-val.html|
|Raw Results:||Passed 871 (of 1065)|
|Adjusted Results:||Passed 871|
This stumbled on accepting a few valid documents, and there were some difficulties handling NMTOKENS attribute lists and handling a mixed content specification.
This shared some of the problems that its non-validating sibling had with reading UTF-16 and multibyte UTF-8 characters. Similarly, it also had problems with names which actually tried to exercise the variety of name characters permitted by the XML specifications.
The output was much more correct than the output from its non-validating sibling. That's a bit puzzling, but it does suggest that the core engine needs only minor tweaks to make sure they're both equally correct.
Also of note is the fact that this validating processor accepted none of the SGML-isms that the non-validating one allowed in its input DTD syntax. Again, the non-validating processor should be acting more like the validating one.
Other than rejecting documents it shouldn't, the problems with this processor mostly related to validity violations that were not reported at all, including:
- Improper nesting of PEs in declarations (neither of the nesting constraints seems to be tested);
- Documents improperly declared as standalone;
- Several cases of content model violations (such as for an incorrect number of elements, and some cases of data inside an EMPTY element);
- Illegal attribute value defaults were accepted for ENTITY/ENTITIES attributes.
- A reference to an undeclared notation was ignored.
Internal errors were reported in various cases when validating illegal documents. These included array bounds exceptions (e.g. with NMTOKENS attributes) and null pointer exceptions working with IDREF/IDREFS values.
Significantly, none of the validity errors were continuable; some were correctly reported as non-fatal through SAX, but then the processor refused to continue processing. (As noted above, some were incorrectly reported as fatal errors in the first place.)
|Processor Name:||Sun ``Java Project X''|
|Version:||TR2 (May 21, 1999)|
|Size of JAR File:||132 KBytes (or 246 KBytes uncompressed)|
This is the validating mode of the non-validating processor presented elsewhere.
|Full Test Results:||report-sun-val.html|
|Raw Results:||Passed 1065 (of 1065)|
|Adjusted Results:||Passed 1065|
This processor reported no conformance errors. That was a design goal of the processor.
I analysed the negative results when I worked at Sun, and believe that every diagnostic reports the correct error. (This is the only parser that I can report I have carefully examined for that issue.)