MSXML.DLL: Non-validating mode
November 17, 1999
|Processor Name:||MSXML.DLL (non-validating)|
|Full Test Results:||msxml-nv.html|
|Raw Results:||Passed 931 (of 1067)|
|Adjusted Results:||Passed 931|
Most of the time I found the diagnostics to be quite comprehensible; this is valuable to anyone trying to use them. I probably looked at about half the negative test results, and while I found a misleading diagnostic, I didn't notice any indications of significant problems there. I'll be optimistic and assume that the other half of those diagnostics check out as well, so that the raw score is accurate.
Problems Encountered Processing Legal Documents
There are cases where this processor is rejecting documents which it should clearly be accepting. The processor:
- Doesn't accept some XML 1.0 names using non-ASCII characters.
- Treats many validity errors as if they were well-formedness errors, by reporting
as fatal errors. Since the processor was told not to report such errors, it should
them rather than report them.
- The Notation Declared, No Duplicate Types, Unique Element Type, One ID Per Element, ID Attribute Default, Notation Attributes (at least one subclause), and Attribute Default Legal Validity Constraints (VCs) are treated like Well Formedness Constraints (WFCs).
- The Entity Declared VC is similarly treated. See my earlier review; this area of the specification is problematic, and as an implementer I have a hard time blaming anything except the spec for this problem. It would be better to have only the WFC, both for users and for implementers.
- The Proper Declaration/PE Nesting VC is another entry in this category. Again, see my earlier review, for much the same conclusion: it'd be better if this were a WFC, or if there were no constraint here at all.
- Treats conditional sections as if they were individual markup declarations for the purposes of testing parameter entity nesting. This is clearly contrary to the specification. Even if it were appropriate to report violations of validity constraints when validation was not requested (see previous issue).
- Doesn't know how to map character references to surrogate pairs, when that's needed.
- Expands PEs incorrectly inside internal entity declaration literals. (In that case they should not be padded with spaces.)
- Rejects documents conforming to the XML 1.0 specification that use colons in ways the XML namespaces specification does not permit. This is not optional; there is no "XML 1.0 mode."
- Rejects redefinition of built-in entity < using the exact declaration given as an example in the XML spec.
When the MSXML.DLL processor accepts documents, it isn't always reporting the correct information to applications. Such problems can in some cases be quite significant:
- Attribute values were not normalized according to the XML specification.
- Whitespace was not handled correctly, even when the processor was configured to preserve whitespace. By default this DOM acts as if it were an application applying xml:space='default' handling, rather than as if it were an XML processor.
- PUBLIC identifiers are not normalized according to the XML specification.
- With multiple declarations of an attribute, only the first one is supposed to matter; but the others showed up in the output.
- External entities with just a single character cause some tests to fail with normalization errors.
Although they were not reported by this test suite, and do not show up in the statistics above, I will mention two other known problems with this processor, since they prevented this processor from working with XML documents I happen to have found "in the wild," on the Web.
- The processor adds a constraint that is found neither in the XML 1.0 specification nor in the XML Namespaces specification: Namespace declarations placed in a DTD are required to be declared as #FIXED.
- Most recently I happened across a document which used a reference to a Unicode character that was inappropriately rejected: U+FFFD. (U+FFFC was also rejected when I tried that one, suggesting that it wasn't just a ">" vs ">=" coding error.) In this case, it was easy enough to fix since it was defined in a DTD that I could change. However this will not always be the case.
In summary, most of the problems of the non-validating mode parser are revealed in these positive tests, and involve either reporting the wrong data (usually whitespace issues) or certain inappropriately performing validity checks. However, that evaluation is "by volume, not weight," and some of the other issues may need some attention in your system designs.
Problems Encountered Processing Malformed Documents
There weren't many obvious failures here:
- Accepts various illegal characters, such as control characters in the 0x00 to 0x1F range and escapes. They are accepted both as literals and as character references, though in some cases literals may be rejected (as they should always be).
- PUBLIC ids with some illegal characters are accepted.
- Whitespace before an XML declaration is permitted.
- Permits illegal text declarations, missing the mandatory encoding="..."
- Unpaired Unicode surrogate characters are accepted, both as literals and as character references.
Accepting illegal characters is likely to cause the most interoperability problems of those failures.
Problematic Test Cases
There are cases where the MSXML.DLL processor raises issues that the OASIS/NIST tests should address, in some cases by changing the tests:
- MSXML.DLL is unique among all XML processors I've seen in that it demands that general entities, which are never used, be well formed. One way to look at this is that it is reporting potential well-formedness errors, not actual errors. On the other hand, the XML specification does not distinguish between entities that are used and those that are not, so it is easily argued that the tests that expect these not to be reported are themselves in error. I confess to feeling this is a case where the XML specification needs clarification, particularly since I've seen no other processor that takes this interpretation.
- Uses the model of names, and name tokens, found in the XML Namespaces specification, rather than the XML 1.0 model. Conformance for the namespace specification is not defined in a way that a processor can be tested for conformance, but such tests are desirable.
It is interesting that the first issue above, regarding the constraint on unused general entities to be well-formed, may be coupled to the use of DOM as the processor API in this case. DOM permits, but thankfully does not require, much information to be exposed. Many DOM implementations use that flexibility to avoid exposing the contents of entities, among other facilities. Only DOM implementations, or similar APIs, that expose such contents appear to get any benefit from having such a well-formedness constraint.
Some of the DOM operations used to turn the MSXML.DLL processor's DOM output into something that could be examined for correctness had an unanticipated side effect. They identified problems in the DOM implementation that had been hooked up to the underlying processor. These need to be worked around, otherwise exceptions, reflecting internal errors of some kind, are thrown by some DOM operations:
- The DocumentType node has children, which it should not. These children must be explicitly ignored for many operations. This may be the reason that Document.getElementsByTagName returned some elements more than once when they came from external entities.
- In some cases the Element.normalize method throws an exception. This seems coupled to external entities with just one character, marking a line end.
- Text declarations (<?xml encoding='...'?>) at the beginning of external entities are exposed as if they were processing instructions. (Regardless of partial syntactic similarities, the XML spec is quite explicit that processing instructions do not use the name 'xml.') These need to be explicitly removed or ignored in certain cases.
In addition, I noticed that in this DOM, the SYSTEM identifiers found in Entity nodes are not resolved. Several other DOM implementations provide such IDs in fully resolved form, making less work for applications that need to use such URIs. The DOM specification should probably make both available because neither approach can address all problems.
The online MSDN documentation for this DOM was incorrect when I looked at it, though I understand that will be fixed. The reason is worrisome: when looking at this documentation with Netscape Communicator, I was served pages which didn't list a number of important standard methods for the NamedNodeList objects (such as the item method). I'm told that if it's read using Internet Explorer 5, and with use of ActiveX controls enabled, the content is correct. Since I disable use of ActiveX controls because of their security problems, accurate system documentation was unavailable to me.
As noted earlier, DOM still needs some work before it can truly be an implementation-independent API. This includes having ways to hook a DOM up to an XML processor (parsing document text into a DOM tree), and setting options for validation, whitespace handling, and use of various types of nodes in resulting tree.