Microsoft XML Parser Conformance
Contents |
|
Part 2: Non-validating mode |
Last September, David Brownell conducted a review of XML parsers for XML.com, testing them for conformance to the XML 1.0 specification. In this follow-up article, he tests Microsoft's MSXML.DLL parser, as found in Internet Explorer 5. Unlike previously tested parsers, the Microsoft parser does not provide a SAX interface, used in the testing procedure. As a result of collaboration with Microsoft, the author constructed a Javascript DOM-based test harness. The results of the tests gave the Microsoft parser a "pretty good" ratingin the top 25% for conformance. They did however reveal a serious flaw with DTD handling and validation, for which Brownell presents a workaround.
In my earlier conformance review for Java XML Processors, I evaluated a dozen XML processors written in Java and using the SAX API. Feedback I got from that article was generally positive, and several readers suggested I provide a corresponding evaluation of another widely available XML processor: the Microsoft XML Parser (MSXML.DLL), which is the one bundled with the Internet Explorer 5 web browser. This article provides such an evaluation.
Some readers were also confused about Microsoft's Java XML processor, called "MSXML" in that earlier review. Briefly, Microsoft has had several implementations of XML processor technology. While today one tends to only hear about the latest version of such technologies, they have all been called "MSXML," or "MS XML," in common usage, by numerous people, including some Microsoft staff. Since the Java processor hasn't been updated in well over a year, some confusion seems inevitable. The Java processor was formally called the Microsoft XML Parser for Java. I hope that helps to clarify the distinctions between the various packages; the details of the two reviews should also help.
The version of the Microsoft XML (MSXML) processor reviewed here is the one that has been bundled with Microsoft's Internet Explorer 5.0 web browser. It can be accessed as "MSXML.DLL," and can be redistributed with other software, as part of Win32 applications. Since it provides a COM API, it can be used from JavaScript, C/C++, Visual Basic, and other COM-aware programming languages. It can even be used from Java, but for most Java developers, that support is not particularly useful since it requires using Microsoft's JVM, and does not support the standard SAX or W3C DOM APIs (org.w3c.dom.*).
I encourage you to read my earlier article for more background on testing XML conformance. Briefly, there are several kinds of tests, which are supported by test casesnot yet in final formcollected and organized by a joint OASIS/NIST working group. These tests need to be run through a test harness using some particular API to access the XML processor under test. The earlier review used SAX as that API, but that would not work for the MSXML.DLL processor, so a new harness was needed. The harness produces some sort of testing report. This article includes the raw test reports, which are in an HTML format that should be easy to use.
I was pleased to receive queries from Chris Lovett, a Program Manager in the XML Group at Microsoft, about those test cases. After some email back and forth, I had a basic JScript test harness my mailbox, which was good, since I usually stick to Java, and it's always a lot easier to improve something that already works! That version has been substantially enhanced, and you can see the reports it now generates in the review below, or run the tests yourself and see what turns up on your own system.
As before, that test harness is provided here as an Open Source tool for general use. In this case, I've put it under the GNU Public License. I hope the various DOM portability issues will get resolved so that the same code can be used with the XML processors in Mozilla (in some beta version soon) and in Internet Explorer.
Also as before, I'd like to emphasize that these reports are in no way official. They don't represent anyone's opinion but my own.
You may recall comments in the earlier review about problems using DOM as a standard XML processor API. Those still hold true. This harness had to use Microsoft-proprietary APIs to acquire a DOM Document object, to populate it with the contents of an XML file, and to detect and report parsing errors. I still remain hopeful that those issues, shared by all bindings of DOM, can be fixed in some upcoming version of the DOM API so that applications using DOM can use any vendor's implementation, in the same way that SAX currently provides an OS-independent API.
In order to ensure that these results can be accurately compared against those in the earlier review, I did two things:
Note that the source code distributed with the earlier review describes how the July version of that test database needed to be patched.
This table provides a quick reference to the results of the testing:
| Processor Name and Version | Passed Tests | Rating (Out of 5) | Summary |
| MSXML.DLL (non-validating)
5.00.2314.1000 | 931 | ![]() ![]() ![]() ![]() |
Overall this processor is above average, though some of its problems have a broad impact. In addition to a variety of problems which should be readily fixed, it (wrongly) tests validity constraints in many cases. |
| MSXML.DLL (default mode)
5.00.2314.1000 | 895 | ![]() ![]() ![]() | Since it accepts documents as "valid" that don't even have a DTD, all applications need to apply a workaround. |
More detailed analysis of each processor mode can be found in the following sections, with links to the complete testing reports.
|
| Processor Name: | MSXML.DLL (non-validating) |
| Version: | 5.00.2314.1000 |
| Type: | Non-Validating |
| DOM Bundled: | Yes |
| Size: | 490 KB |
| Download From: | http://www.microsoft.com/xml/ |
This is the processor which is bundled with the Internet Explorer 5 Web browser. As a COM component, it may be used from JavaScript, C/C++, Visual Basic, and other programming languages. The processor is only accessible through an extended DOM API; JavaScript programmers have access to most of the W3C DOM Level 1 functionality.
| Rating: | ![]() ![]() ![]() ![]() |
| Full Test Results: | msxml-nv.html |
| Raw Results: | Passed 931 (of 1067) |
| Adjusted Results: | Passed 931 |
Most of the time I found the diagnostics to be quite comprehensible; this is valuable to anyone trying to use them. I probably looked at about half the negative test results, and while I found a misleading diagnostic, I didn't notice any indications of significant problems there. I'll be optimistic and assume that the other half of those diagnostics check out as well, so that the raw score is accurate.
There are cases where this processor is rejecting documents which it should clearly be accepting. The processor:
When the MSXML.DLL processor accepts documents, it isn't always reporting the correct information to applications. Such problems can in some cases be quite significant:
Although they were not reported by this test suite, and do not show up in the statistics above, I will mention two other known problems with this processor, since they prevented this processor from working with XML documents I happen to have found "in the wild," on the Web.
In summary, most of the problems of the non-validating mode parser are revealed in these positive tests, and involve either reporting the wrong data (usually whitespace issues) or certain inappropriately performing validity checks. However, that evaluation is "by volume, not weight," and some of the other issues may need some attention in your system designs.
There weren't many obvious failures here:
Accepting illegal characters is likely to cause the most interoperability problems of those failures.
There are cases where the MSXML.DLL processor raises issues that the OASIS/NIST tests should address, in some cases by changing the tests:
It is interesting that the first issue above, regarding the constraint on unused general entities to be well-formed, may be coupled to the use of DOM as the processor API in this case. DOM permits, but thankfully does not require, much information to be exposed. Many DOM implementations use that flexibility to avoid exposing the contents of entities, among other facilities. Only DOM implementations, or similar APIs, that expose such contents appear to get any benefit from having such a well-formedness constraint.
Some of the DOM operations used to turn the MSXML.DLL processor's DOM output into something that could be examined for correctness had an unanticipated side effect. They identified problems in the DOM implementation that had been hooked up to the underlying processor. These need to be worked around, otherwise exceptions, reflecting internal errors of some kind, are thrown by some DOM operations:
In addition, I noticed that in this DOM, the SYSTEM identifiers found in Entity nodes are not resolved. Several other DOM implementations provide such IDs in fully resolved form, making less work for applications that need to use such URIs. The DOM specification should probably make both available because neither approach can address all problems.
The online MSDN documentation for this DOM was incorrect when I looked at it, though I understand that will be fixed. The reason is worrisome: when looking at this documentation with Netscape Communicator, I was served pages which didn't list a number of important standard methods for the NamedNodeList objects (such as the item method). I'm told that if it's read using Internet Explorer 5, and with use of ActiveX controls enabled, the content is correct. Since I disable use of ActiveX controls because of their security problems, accurate system documentation was unavailable to me.
As noted earlier, DOM still needs some work before it can truly be an implementation-independent API. This includes having ways to hook a DOM up to an XML processor (parsing document text into a DOM tree), and setting options for validation, whitespace handling, and use of various types of nodes in resulting tree.
|
| Processor Name: | MSXML.DLL (default mode) |
| Version: | 5.00.2314.1000 |
| Type: | Validating |
| DOM Bundled: | Yes |
| Size: | 490 KB |
| Download From: | http://www.microsoft.com/xml/ |
This is the validating mode of the parser which is bundled with the Internet Explorer 5 web browser. See the coverage of the non-validating mode for basic information.
| Rating: | ![]() ![]() ![]() |
| Full Test Results: | msxml-val.html |
| Raw Results: | Passed 895 (of 1067) |
| Adjusted Results: | Passed 895 |
Unlike the situation with some other "dual mode" parsers, the MSXML.DLL processor does not do a complete personality switch, so this description builds heavily on the coverage of the non-validating mode, focusing only on what changes when validation is enabled.
This worked basically like the non-validating mode, with the only new problem being that the parser complained when given certain entity expansions: it didn't use the elements found in those entities when checking whether the content model for the parent element was satisfied.
The parser called into question one additional pair of test cases. Specifically, it rejected a CDATA usage which has recently been deemed illegal. Presumably, after this erratum to the XML specification is published, these tests cases will be recategorized.
Parser output was like that for the invalid documents; notably, it doesn't report whitespace or normalize attributes correctly.
One basic issue to note here is that because of its API, this parser is structurally prevented from continuing after reporting a validity error. The API only allows reporting fatal errors. This may not affect conformance (the "at user option" requirement in the XML specification does not seem to require that the option should affect only one error at a time), but it does constrain the use of this API for detecting and correcting multiple validity errors.
The following validity errors were not detected:
To make XML validation work correctly, your code to load an XML document should always look something like this (intended to work correctly even if you're not validating):
document.load (uri);
if (document.validateOnParse
&& doc.parseError.errorCode == 0
&& doc.doctype == null) {
// it's a set of unreported validity errors
} else if (doc.parseError.errorCode != 0) {
// error reported in parseError object
}
I would expect validation to work exactly as defined in the XML 1.0 specification. Validation using any of the various schema systems now available (or being developed) is a separate issue, and merits separate APIs.
|
|
Contents |
|
Part 1: Microsoft XML Parser Conformance
|
The non-validating mode of the MSXML.DLL processor, with whitespace handling set appropriately, is relatively conformant, although not without its problems. Certain familiar errors are also seen in this processor:
Both processor modes are in the top quartile of the ones tested in the earlier review, but are not the top rated ones. That gets this processor a "pretty good" rating in my book. Although I'm bothered by the validating mode needing an application level workaround, if you apply it, you'll find that nearly another fifty test cases will behave.
As more XML processors approach meaningful levels of conformance, it will be increasingly important to understand exactly which conformance errors show up in a given parser. The raw "passed tests" statistic, used to assign stars in this evaluation and the previous one, will always miss some important information. That's why I've tried, in both this review and the earlier one, to give a lot of analysis for the failure modes of the processors that have the best "passed tests" statistic. Since developers have many choices for their XML processors, it's important that those choices be well informed ones.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.