Being Too Generous
September 19, 2001
This week the XML-Deviant looks at some recent community criticism over the XML support in Internet Explorer, which has been resolved with some promising feedback from Microsoft.
SGML on the Web But Not in the Browser
Despite its many and varied successes XML has still not achieved it's aim of being "SGML on the Web". At least not within the most popular viewport of the Web, the browser. HTML is still the Web's lingua franca despite the desire of many in the XML community to see it be deprecated in favor of XHTML or CSS-styled XML documents. In other environments XML has been a runaway success, yet it is still having trouble gaining a foothold in user agents. Arguably RSS is the most successful XML format being displayed to users and processing even popular formats like SVG is handed off by browsers to optional plug-ins rather than being natively supported.
There are a few reasons for this. The XML community has not made the effort it could to convince the web development community of the advantages of XML, leading to an image problem. Strong disagreements over the relative merits of XSLT and CSS has also displayed a lack of common vision for the role of XML in client-side document styling. There can be little doubt that the lack of good XML/XSLT/CSS support in recent browsers is the root cause of the problem. Which is ironic since the browser was instrumental in getting pointy-bracket parsers on millions of desktops around the world.
Of course the situation is not completely bleak. XML processing capabilities are appearing in both major browsers. The added irony is that Internet Explorer appears to be leading the way, despite the fact that the most widely regarded tools in the XML toolkit are open source, and despite MS XML parser's baroque installation modes ("side by side", "replace mode").
But, as recent discussion has shown, some matters are much worse.
Providing an update on Internet Explorer 6 for readers of his Cafe con Leche site recently, Elliotte Harold lambasted Microsoft over some serious flaws in its XML support; comments that Roger Costello brought to the attention of the XML-DEV community.
Leaving aside the issues of CSS support and the correct mimetype for XSL documents, the discussion highlighted two conformance issues with IE's XML parsing. First, it accepts control characters which the XML specification says are illegal in XML documents -- e.g.  (Control-E) -- thereby breaking basic well-formedness constraints. And, second, it rejects any document which contains Unicode Plane 1 characters.
The response to these conformance issues was strong; perhaps none more so than Tim Bray's.
Let's see, on one hand Ballmer is on stages everywhere saying that XML is the core framework for e-business, and at the same time the IE group is shipping product in a mode that undermines one of the core principles that makes XML usable - that data is either WF or it's crap. The next step will be to make end-tags optional because that will be more comfortable for the FrontPage users out there. Feh. Really disgusting.
David Carlisle also stressed the significance of the 'Plane 1' problem.
This bug (not allowing plane 1 characters) is _very_ serious. Also, calling it a bug implies it was done accidentally which does not appear to be the case. You don't even need to be explicitly using these characters for it to bite you. For example _every_ valid MathML document is rejected by IE6 (as it reports a fatal error on the MathML DTD).
It rapidly became clear that the issue wasn't with the MSXML parser itself, but in the way it's used by Internet Explorer. Evidence for this came from both a Microsoft newsgroup discussion and a confirmation by Michael Rys.
MSXML 3.0 as it is used in IE 6.0 is running in a backwards compatibility mode that replicates the previous errors so that client applications that upgrade to 6.0 will not fail. If you use the MSXML 3.0 component directly in your programming and not on the client via IE 6.0, you are getting 100% compliance...
Also in XML-Deviant
This fact is worth stressing: the IE developers deliberately chose to cripple the XML parsing in Internet Explorer. This subverts the concept of the "draconian parse" which is a central precept of XML, and it opens the door to the kind of problems that we've seen with HTML. The fact that developers often use IE to display XML files, because other browsers have yet to catch up, and it has a friendly default tree view for XML documents, exacerbates the problem as all manner of XML documents may be presumed to be well-formed as a result of being displayed by IE.
Fixing the Problem
Joshua Allen defended this decision, asserting that a browser should display files to the user wherever possible. That's its core function, and is consistent with the principle of "being generous in what you accept".
I think that its getting on thin ice to say authoritatively that the browser should refuse to show the users those files. Yes, we can all agree that "data is either WF or it's crap." But it is kind of a stretch to say that "it's crap == don't show the file." Nothing in 1.0 spec says that the system should do everything in it's power to prevent the user from ever looking at the content of the file if the file has a WF violation. Maybe IE could put a status-bar message that said "BTW, this file is not really XML, it is crap". But IE is capable of displaying all sorts of files, not just XML files, so what is wrong with IE displaying the file that the user asked it to display?
Not surprisingly this did little to retard the flames of debate. Most of the concerns were not over IE displaying an invalid file, but that it does so silently and happily allows that file to be fed into subsequent XML processing, such as applying a stylesheet. The focus of the debate quickly shifted to suggestions on how the situation could be resolved. For example Eric Bohlman suggested backwards compatibility should be a patch, and not the default.
I *can* understand Microsoft's concern that they not just abruptly break something that to their customers appears to be "working." Maybe the solution is for them to offer a time-limited "bug-compatibility patch" that would extend the "tolerant" behavior long enough, and only long enough, for everybody to fix their broken systems. Customers would have to do something explicit to activate the patch, and the activation process would have to warn them that they could *not* count on this behavior in the future.
Expressing similar opinions, Benjamin Franz asserted that Microsoft should properly inform its customers about the problem and produce tools to rectify it, in order to avoid repeating the problems associated with HTML.
Sometimes you have to take the hit on the chin and say to your customers "This is a bug. We know you may have stored data in form X (and here is a tool to help _filter_ the problematic data for you before you deliver it to a client if you absolutely cannot repair your database), but it has to be changed because that behavior was a bug and your XML _will not_ interoperate with other people if it is not fixed now. And the more data you store like this the worse it will get."
"Bugwards" compatibility is precisely why HTML ended up in the mess it was in a couple of years ago (and still is in to a large extent). Each new browser had to exactly replicate the _known to be wrong_ behavior of the previous one. Because rather than fix the problem when there were only a few _thousand_ web browsers installed, it was left to fester until their were _tens of millions_ installed.
Don Park also highlighted Microsoft's responsibility.
IMHO, grabbing 85% of the browser marketshare comes with certain responsibility to the public...
The promising conclusion to this debate is that following the storm of feedback, Joshua Allen was able to prove that Microsoft was listening and to go on record with resolutions to the problems.
Q) IE rejects characters above 0x10000.
A) This is just plain bug, and we are going to fix this.
Q) IE doesn't crash on control characters.
A) We are planning to still allow these to be displayed, but flag them as "not well-formed" using Julian's or a similar style sheet, so that the user knows there is a problem.
The posting indicated that Microsoft is committed to achieving both these goals, but without any clear schedule beyond an "MSXML SP3" release target.
It remains for the community to monitor progress and ensure that appropriate pressure is applied so that it is completed successfully. It's unfortunately easy to become complacent and allow seemingly minor conformance issues slip by, especially if they save one time with a task at hand. The ultimate cost though is to the common good, with the lowest common denominator getting a notch lower every time.