Time to Refactor XML?

February 21, 2001

Leigh Dodds

The W3C has been particularly busy over the last few weeks, releasing a flurry of new Working Drafts. While welcoming this progress, some members of XML-DEV have expressed concern over the new direction that these specifications have taken.

Intertwined Specifications

A succession of new Working Drafts have appeared on the W3C Technical Reports page. The list includes requirements documents for XSLT 2.0, XPath 2.0 and XML Query a data model and an algebraic description for XML Query, and a resurrection of the XML Fragment Interchange specification.

The most striking aspect of these specifications is not their sudden appearance but, rather, their mutual interdependence:

  • XSLT 2.0 must support XML Schema datatypes
  • XPath 2.0 must support the regular expressions defined in XML Schema datatypes, as well as the XML Schema datatypes
  • XML Query and XPath 2.0 will share a common data model
  • XML Query may itself use XML Fragments
  • XML Query must support XML Schema datatypes
  • Both XPath and XML Query must be modeled around the Infoset, and particularly the "Post Schema Validation Infoset"
  • XML Schema itself depends on XPath to define constraints

As this list shows, dependence on the XML Schema datatypes and the Post Schema Validation Infoset are particularly prominent. This has produced a few furrowed brows on the XML-DEV mailing list.

Mixing Spaghetti?

Simon St. Laurent was the first to express concern at the number of dependencies between specifications and particularly the reliance on XML Schemas.

XSLT 2.0 processors will need an understanding of XML Schema datatypes, while XPath 2.0 processors will need to implement the regular expression language specified in XML Schemas. XQuery builds on all of these, using the strategy pioneered by Quilt. The current draft of XML Schemas requires schema processors to understand XPath as well.

Once, long ago, I wanted the W3C to make sure its specs coordinated and made some kind of coherent sense. That never really happened, but now we seem to moving toward a jungle of intertwined specs, with complexity increasing despite/because of reuse.

The initial response to St. Laurent's comments was mixed. Ben Trafford believed that while a roadmap would certainly be useful, the additional complexity is required for certain applications.

... the W3C ought to publish a roadmap of interdependencies between the specifications, and try to minimize the dependencies as much as possible. However, two things are clear from implementation:

  1. Lots of people use plain XML for their tasks, without ever dealing with the more complicated stuff, except for maybe DOM.
  2. People who want more oomph out of XML than the base spec provides will pay for it. How? Because more potent applications of XML require more processing information than base spec provides. We need Infoset. We need XPath. We need some sort of schema that's more powerful than DTDs.

Len Bullard believed that, given the experience of the HyTime standard, complexity should have been expected. In his view the real issue is the initial hype associated with XML, rather than the direction that new developments are heading.

Hard problems; heavy solutions. I only got mad when the XML spec was tossed to the floor asserting it would be easy after that. Anyone with experience knew it would only get harder and more complex. Easier than SGML/HyTime? Maybe, but you'll have to ask the implementors of those systems about that. All I know is a lot of what I see here looks a lot like what I saw there. The names were changed, the concepts "reified" but overall, the same stuff. The big differences are well-formedness and namespaces. Those really are simplifications of the original concepts. As James pointed out recently, SGML conflated parsing and validation. Separating those has been enormously useful.

But easy? That IS how XML was sold, I agree. Caveat emptor.

Jonathan Robie, co-author of Quilt, which has played a significant role in guiding the development of XML Query, fully supported the rationalization that the new specifications promise.

Well, I'm not sure that reuse really does increase complexity. Wouldn't it be more complex and confusing if each specification used a different data model and type system, or if path expressions had different meanings in XPath and in XQuery?

Or is your concern that the type system of XML Schema adds complexity to the rest of the W3C standards that support it? Are you saying that other W3C specs should not support schema?

The reliance on XML Schemas is at odds with the recent "Schemarama" discussions which promoted the idea of a plurality of schema validation languages, with no single language being perfect. The adoption of TREX by OASIS certainly lends support to this view.

Kimbro Staken was among the first to support Simon St. Laurent's concerns. Staken felt uneasy at the speed with which new specifications are being built, in some cases on incomplete foundations.

... [P]art of the problem is that everything is so new. These new specs are building on the (somewhat) older specs (many that aren't even complete) as if they're solid well tested and proven commodities. The reality is this isn't the case and far too many of the specs coming out of the W3C amount to completely new technologies and they're not being adequately proven through implementation before being built upon. Did XQuery use W3C XML Schema because it is solid and well proven, no they used it because it is "W3C" XML Schema...Reuse is clearly a good thing but what is being reused must itself be stable and good otherwise you just get rubbish.

Staken's comments echo "The Rush to Standardize" discussion covered by the Deviant last October.

Tim Bray said that disdain for the 80/20 rule lies behind the complexity surge.

One man's "intertwined" is another's "consistent". I think a lack of interconnectedness would be a bigger problem.

The real problem, the one that I think is actually causing Simon's pain, is that these things are all too big, too complicated, and have a contemptuous disdain for trying to hit 80/20 points. Missing the point of the Web, I call this. Mind you, having been through the closing months of getting XML 1.0 finished when it was already too big and every power player and their dog wanted to get just one more feature in, I can see how it happens. Doesn't mean it's right.

A layered approach to specification design has long been a favorite topic of XML-DEV. Eric van der Vlist, reiterating this point, noted that failure to layer specs cleanly can cause significant problems.

A problem with intertwined specs is that if we don't want that they become like a plate of spaghetti we need to define a clean layered structure first and probably define a complete datatype system rather than trying to reuse one that has been defined for what is IMHO a specific purpose and has had to accept to drop many features to deliver in a timely fashion.

Another problem is that we are creating a chain and that if we do so, the strength of the chain will be the strength of its weakest link.

Rick Jelliffe was particularly vocal in his criticism of the current situation. Initially Jelliffe noted that the spiral of complexity followed by "subsetting" should not be surprising.

... [J]ust as SGML was too complex for WWW applications so we created a subset, and XPath is too complex for some (streaming) applications so we will make a down-reference-only subset (you didn't hear that from me), so XML Schemas will be too complex for some things and in time there will be a subset of it (in fact, we already have that with ISO RELAX and OASIS TREX). This natural rhythm should be unsurprising, though it may make us sea-sick.

Jelliffe's real concern was at the prospect of the Post-Schema-Validation-Infoset (PSVI), a concept introduced in the XML Schemas specification as the foundation for further standardization work. He's not hopeful about the prospects for change.

I do not believe it is possible to stop PSVI-based specs at W3C, nor do I believe that would be remotely desirable: there are popular and useful fat applications (of course I mean database management systems) that will be well-served by the PSVI, and it will open the door for lots of useful innovation and press releases. Lots of people are interested in using XML as a framework not for document exchange but for databases, and why not?

Jelliffe further claimed that when standards are based on PSVI they are no longer XML standards.

This utterly changes XML: it is the infoset road already embarked on by XPointers (allowing ranges, a thing that cannot be serialized, though their new solution of corrupting the data by providing spurious containers inline is a good one.) I also agree with Simon's naming point: a specification that works on the infoset and produces an unserializable result, or that works on something that is unserializable should not be called "XML blah blah".

I would prefer the Post-Schema Validation Infoset to have some completely different name, such as the W3C Typed Tree Information Set; then XQuery can be called the Typed Tree Query Language, and XPath 2 can be the Typed Tree Path language, etc. At a certain point, it stops being XML, and should be given a separate name. (And does anyone else find it strange that W3C is creating database-supporting specs while OASIS is doing lightweight schemas for the WWW?)

PSVI is a significant component in its own right, the impact of which will only become clear in the light of its ongoing effects.

Refactoring XML

One way to tackle this issue would be to refactor these specifications, to highlight their interdependencies, with an eye to making them specifications in their own right. There is some precedent for this. The XSL specification originally included transformation and path expression details which were later factored out to form XSLT and XPath. This resulted in earlier delivery for those specifications (XSL-FO is languishing in development) and many developers have been able to put them to good use.

Might the XSL example be employed elsewhere? Could the current crop of specifications be refactored to separate dependencies? Simon St. Laurent said that the regular expression language defined in XML Schemas would benefit from this treatment.

I'd suggest that the regular expressions language which is currently part of XML Schemas be pulled from that spec much as XPath was pulled from XSLT, and given a separate publication for easier reuse. Right now it looks like XML Schemas depends on XPath which will depend on XML Schemas - not exactly a friendly foundation for development.

St. Laurent also suggested that all specifications should allow hooks for different types of schema language.

Given the heavy dependencies on XML Schema datatypes, Charles Reitzel urged that the specification should be pushed forward as a priority> He outlined a layered model as an example of the benefits this might bring.

Without having read the details of how XSLT, XQuery, XPath and XSchema will all inter-connect, it also strikes me as basic that XML Schema Data Types be released ASAP and that the developer community be given 6 months to put it through it's paces before loading it down any further.

Separating data types from structures allows XPath to be layered in between: XSchema Structures => XPath => XSchema Data Types. It also allows alternate schema/rule languages to be defined at any level: over XPath, Data Types or standalone.

Refactoring and iteration have become common features in many development methodologies. Extreme Programming is an example. Acknowledging that it's hard to get things right the first time, and allowing changes in requirements, is fundamental to complex development processes, including the XML standards process that many are keen to see take shape.

Perhaps it's time to take a break from weaving the Web, to do a little refactoring of its components, before moving to the next development cycle?