MSXML.DLL (non-validating)
| Processor Name: | MSXML.DLL (non-validating) |
| Version: | 5.00.2314.1000 |
| Type: |
Non-Validating |
| DOM Bundled: | Yes |
| Size: | 490 KB |
| Download From: | http://www.microsoft.com/xml/ |
This is the processor which is bundled with the Internet Explorer 5 Web
browser. As a COM component, it may be used from JavaScript,
C/C++, Visual Basic, and other programming languages. The processor is
only accessible through an extended DOM API; JavaScript programmers
have access to most of the W3C DOM Level 1 functionality.
| Rating: |     |
| Full Test Results: | msxml-nv.html |
| Raw Results: | Passed 931
(of 1067) |
| Adjusted Results: | Passed 931 |
Most of the time I found the diagnostics to be quite comprehensible;
this is valuable to anyone trying to use them. I probably looked at
about half the negative test results, and while I found a misleading
diagnostic, I didn't notice any indications of significant problems
there. I'll be optimistic and assume that the other half of those
diagnostics check out as well, so that the raw score is accurate.
Problems Encountered Processing Legal Documents
There are cases where this processor is rejecting documents which it
should clearly be accepting. The processor:
- Doesn't accept some XML 1.0 names using non-ASCII characters.
- Treats many validity errors as if they were well-formedness errors,
by reporting them as fatal errors. Since the processor was told not to
report such errors, it should ignore them rather than report them.
- The Notation Declared, No Duplicate Types,
Unique Element Type, One ID Per Element,
ID Attribute Default, Notation Attributes (at
least one subclause), and Attribute Default Legal
Validity Constraints (VCs) are treated like Well Formedness
Constraints (WFCs).
- The Entity Declared VC is similarly treated.
See my earlier review; this area of the specification is
problematic, and as an implementer I have a hard time blaming
anything except the spec for this problem. It would be better
to have only the WFC, both for users and for implementers.
- The Proper Declaration/PE Nesting VC is another
entry in this category. Again, see my earlier review, for
much the same conclusion: it'd be better if this were a WFC, or if there were no constraint here at all.
- Treats conditional sections as if they were individual markup
declarations for the purposes of testing parameter entity nesting.
This is clearly contrary to the specification. Even if it were
appropriate to report violations of validity constraints when
validation was not requested (see previous issue).
- Doesn't know how to map character references to surrogate
pairs, when that's needed.
- Expands PEs incorrectly inside internal entity declaration
literals. (In that case they should not be padded with spaces.)
- Rejects documents conforming to the XML 1.0 specification
that use colons in ways the XML namespaces specification does
not permit. This is not optional; there is no "XML 1.0 mode."
- Rejects redefinition of built-in entity < using
the exact declaration given as an example in the XML spec.
When the MSXML.DLL processor accepts documents, it isn't always
reporting the correct information to applications. Such problems
can in some cases be quite significant:
- Attribute values were not normalized according to the XML
specification.
- Whitespace was not handled correctly, even when the processor was
configured to preserve whitespace. By default this DOM acts as if
it were an application applying xml:space='default' handling,
rather than as if it were an XML processor.
- PUBLIC identifiers are not normalized according to the XML
specification.
- With multiple declarations of an attribute, only the first one
is supposed to matter; but the others showed up in the output.
- External entities with just a single character cause some
tests to fail with normalization errors.
Although they were not reported by this test suite, and do
not show up in the statistics above, I will mention two other known problems
with this processor, since they prevented this processor from working with XML
documents I happen to have found "in the wild," on the Web.
- The processor adds a constraint that is found neither in the
XML 1.0 specification nor in the XML Namespaces specification:
Namespace declarations placed in a DTD are required to be declared
as #FIXED.
- Most recently I happened across a document which used a
reference to a
Unicode character that was inappropriately rejected: U+FFFD.
(U+FFFC was also rejected when I tried that one, suggesting
that it wasn't just a ">" vs ">=" coding error.)
In this case, it was easy enough
to fix since it was defined in a DTD that I could change.
However this will not always be the case.
In summary, most of the problems of the
non-validating mode parser are revealed in these positive tests,
and involve either reporting the wrong data (usually whitespace
issues) or certain inappropriately performing validity checks.
However, that evaluation is "by volume, not weight," and some of
the other issues may need some attention in your system
designs.
Problems Encountered Processing Malformed Documents
There weren't many obvious failures here:
- Accepts various illegal characters, such as control characters
in the 0x00 to 0x1F range and escapes. They are accepted both as
literals and as character references, though in some cases literals
may be rejected (as they should always be).
- PUBLIC ids with some illegal characters are accepted.
- Whitespace before an XML declaration is permitted.
- Permits illegal text declarations, missing the mandatory
encoding="..."
- Unpaired Unicode surrogate characters are accepted, both as
literals and as character references.
Accepting illegal characters is likely to cause the most
interoperability problems of those failures.
Problematic Test Cases
There are cases where the MSXML.DLL processor raises issues that the
OASIS/NIST tests should address, in some cases by changing the
tests:
- MSXML.DLL is unique among all XML processors I've seen in that it demands that
general entities, which are never used, be well formed. One way to look
at this is that it is reporting potential well-formedness errors,
not actual errors. On the other hand, the XML specification does not
distinguish between entities that are used and those that are not,
so it is easily argued that the tests that expect these not to be
reported are themselves in error. I confess to feeling this is a case
where the XML specification needs clarification, particularly since
I've seen no other processor that takes this interpretation.
- Uses the model of names, and name tokens, found in the XML
Namespaces specification, rather than the XML 1.0 model.
Conformance for the namespace specification is not defined in a
way that a processor can be tested for conformance, but such tests
are desirable.
It is interesting that the first issue above, regarding the
constraint on unused general entities to be well-formed, may be coupled
to the use of DOM as the processor API in this case. DOM permits, but
thankfully does not require, much information to be exposed.
Many DOM implementations use that flexibility to avoid exposing
the contents of entities, among other facilities. Only DOM
implementations, or similar APIs, that expose such contents
appear to get any benefit from having such a well-formedness
constraint.
DOM Conformance
Some of the DOM operations used to turn the MSXML.DLL processor's DOM
output into something that could be examined for correctness had
an unanticipated side effect. They identified problems in the
DOM implementation that had been hooked up to the underlying
processor. These need to be worked around, otherwise
exceptions, reflecting internal errors of some kind, are thrown
by some DOM operations:
- The DocumentType node has children, which it should not. These
children must be explicitly ignored for many operations. This may be
the reason that Document.getElementsByTagName returned some
elements more than once when they came from external entities.
- In some cases the Element.normalize method throws
an exception. This seems coupled to external entities with just
one character, marking a line end.
- Text declarations (<?xml encoding='...'?>)
at the beginning of external entities are exposed as if they were
processing instructions. (Regardless of partial syntactic similarities,
the XML spec is quite explicit that processing instructions do not
use the name 'xml.') These need to be explicitly removed or ignored
in certain cases.
In addition, I noticed that in this DOM, the SYSTEM
identifiers found in Entity nodes are not resolved. Several
other DOM implementations provide such IDs in fully resolved
form, making less work for applications that need to use such
URIs. The DOM specification should probably make both
available because neither approach can address all problems.
The online MSDN documentation for this DOM was incorrect when I looked
at it, though I understand that will be fixed. The reason is worrisome:
when looking at this documentation with Netscape Communicator, I was
served pages which didn't list a number of important standard
methods for the NamedNodeList objects (such as
the item method). I'm told that if it's read using Internet
Explorer 5, and with use of ActiveX controls enabled, the content is
correct. Since I disable use of ActiveX controls because of their
security problems, accurate system documentation was unavailable to me.
As noted earlier, DOM still needs some work before it can
truly be an implementation-independent API. This includes
having ways to hook a DOM up to an XML processor (parsing
document text into a DOM tree), and setting options for
validation, whitespace handling, and use of various types of
nodes in resulting tree.
Back to Microsoft XML (MSXML.DLL) Parser Conformance