Profiling and Parsers

November 22, 2000

Leigh Dodds

XML-DEV has recently discussed the development of subsets of the XML 1.0 Recommendation, the impact that these may have on interoperability, and related issues.

Profiling XML

Starting off a conversation on this contentious topic, Michael Champion reported on the experience of an XML novice developing a validating XML parser:

I had the opportunity to observe a project in which an XML novice built successive iterations of an XML system. MinML (Common XML without even attributes or mixed content) only took a few days to implement (and out of the box, with no optimization, it could parse "MinML" as fast as expat could). The second iteration with Common XML support took a couple of weeks. The third iteration to fully support well-formed XML 1.0 took maybe a month. The fourth iteration to add namespaces took maybe another month. The fifth iteration to support validation took several months.

Champion makes the point that unless there is a clear business case for achieving full conformance, the costs may possibly outweigh the benefits. Not surprisingly, these comments caused a degree of consternation among XML-DEVers who have long been promoting conformance and interoperability. Later clarifying his position, Champion confirmed he was suggesting an "exploration" of potential XML subsets, rather than suggesting that interoperability was not a worthwhile goal.

I think it's quite important to encourage exploration of the "space" described by the various XML Recommendations to find those subsets that provide the best interoperability as well as the best reliability, time to market, performance, etc. Responsible vendors of XML tools have little choice in the matter -- we should and must support the Recommendations as they stand now. But this is not to say that the XML Industry as a whole should accept them as "cast in concrete" standards like screw sizes or light bulb socket threads...I'm simply advocating flexibility and experimentation in the XML specifications until we better hear what the voice of experience has to say.

Responding to Champion's initial message, Rob Lugt disregarded the experience report, noting that complexity of XML 1.0 is not an issue affecting tools vendors.

I agree it is not trivial to implement a full XML parser, but the experience of a novice individual is irrelevant. The Full XML 1.0 recommendation, complex as it is, does not appear to be acting as a barrier to entry for serious tool vendors.

David Megginson also observed that there has been little barrier to entry for individuals either.

Of the first five XML parsers I remember, four were written by individuals (Norbert Mikula, Tim Bray, James Clark, and me), but it was the one from the serious tool vendor (Microsoft) that had serious conformance issues at first. It took me only a few evenings to have a usable beta version of AElfred -- most of the rest of the time was spent optimizing performance and adding support for extra character encodings.

(It should be noted that the individuals cited could hardly be put in the same class as an XML novice!)

One XML subset which will be well-known to XML-Deviant readers is Common XML. Common XML is a product of SML-DEV, a mailing list which was formed after a series of volatile debates on XML-DEV nearly a year ago. Common XML has been covered in a previous Deviant column, "Filling In The Gaps".

In recent discussions, Common XML seems to have acquired support among some XML-DEV contributors. Suggestions of giving it a more formal status have even been raised. The other product of SML-DEV, MinML, has less support; even on SML-DEV there is some debate about its usefulness.

Citing the 80/20 rule of software engineering, Clark Evans, a regular SML-DEV contributor, suggested that 20% of the effort can yield 80% of the functionality. Evans advocated formalizing these subsets to give implementers achievable milestones.

...[D]oes it not make sense to try and identify the 20%, give it a name, like "Common XML". So that as vendors, lone hackers, etc., implement W3C specifications they have a better chance of implementing their first pass in a way which will *maximize* interoperability in our less-than-perfect world?

Stating that this is already common practice among standards bodies, Len Bullard suggested that there are few dangers involved if conformance is carefully tested.

Isn't that typically called profiling and a standard operating procedure for many standards organizations? The citations have to be clean and if possible, conformance tests provided. There is no real danger as long as the propers are followed.

The continuing discussion highlighted the fact that the ISO 8879 SGML standard provides mechanisms for defining profiles. This prompted Michael Champion to wonder whether such a feature would be useful within XML.

One issue that generated a lot of traffic on this list a year ago was whether XML needed a similar mechanism with which one could define a "profile"... that constrained the types of markup to be used in a class of XML applications... Would the people who so vigorously oppose defining "subsets of XML" in the name of interoperability be averse to adding a mechanism like this in a future version of XML?

Profiling Parsers

The ability to define an XML profile, whether formally specified or not, leads naturally to the development of profiled parsers optimized for a particular XML subset or a given schema. Tom Passin noted that this is not a new idea.

Isn't it true that, in SGML, the DTD with its regular grammar is (can be used) to create a parser specialized for the particular DTD - perhaps even on the fly when the document is read? Yet xml seems to have been designed to avoid the need for a customized parser.

The SML-DEV activity has already produced several parsers, including Don Park's MinML parser, Min. However this is where we enter the most contentious territory, namely, parser conformance and its effects on interoperability. Michael Brennan was among those arguing for clear labeling of parser capabilities.

Does it make sense to write specialized parsers that only deal with a specific DTD/schema? Certainly it does. If I have a need to deal with SVG in a program, I am going to try to find an SVG parser before I search for an XML parser, because an SVG parser will probably give me much greater value...But if you are going to write an SVG parser, than call it a "SVG parser", not an XML parser! ... For that matter, if you write a MinML parser, than call it a MinML parser and not an XML parser. If you do that, then you've got no argument from me.

Gavin Thomas Nicol agreed, stating that users of profiled parsers cannot expect to be fully-interoperable with others.

Interoperability comes from standards conformance. People that do not implement the standard should not claim to do so, and should not expect to be fully interoperable with people that do.

Rick Jelliffe suggested that developers should boycott parsers which do not properly advertise their conformance.

... [D]evelopers should boycott parsers that call themselves XML but only implement a subset except for specific-purpose systems: so you it is fine to make a subset parser (e.g. for SOAP) and say "this is a parser for a subset of XML" but it is not fine to say "this is an XML parser".

Nicol later noted that, despite XML's relative immaturity, conformance among XML parsers is encouragingly good.

... XML conformance, for a relatively young standard is generally *excellent*. The fact that XML parsers are a commodity (thereby meaning that many people use the same parser), means that interoperability is even better than one might otherwise get.

Asserting that using profiled parsers within specific application domains does not adversely affect interoperability, Seairth Jacobs observed that

... In the end, deciding what subset of parser should be used for development is every bit as important as deciding what subset of XML should be used for DTD or Schema definition.

It's the specific requirements of a particular application domain that will drive parser selection. Not all domains are created equal, and in some cases (on PDA platforms, for example) speed, footprint, and optimization are critical requirements. For developers working on those platforms, a profiled parser might be the optimal choice. The same applies in situations where the XML being exchanged is rigorously controlled and regulated. In all other domains the "be generous in what you accept..." rule applies, and this requires a fully-conformant parser.