Clark Challenges the XML Community

December 19, 2001

Edd Dumbill

Delivering the opening keynote at the IDEAlliance XML 2001 Conference in Orlando, Florida, James Clark described five challenges faced by the XML community. Just before delivering his speech, Clark was deservedly honored by the conference with the "XML Cup" for his long-standing contributions to the world of XML.

Though at the center of the development of XML 1.0 and XSLT, and much SGML technology before that, Clark has recently become an increasingly dissenting voice at the World Wide Web Consortium. He used his speech to set out his concerns about the position that the guardians of XML now find themselves in.

Making Progress While Keeping XML Simple

The first challenge that Clark identified was the need to make progress in XML technologies without compromising XML's essential strengths. He praised the diverse application of XML across documents, data, remote procedure calls, and the Web, and noted that XML 1.0 avoids forcing information into a single category. The key to this strength, said Clark, is that XML "doesn't do very much. It's just syntax." XML is sufficiently unconstrained to be used to represent many kinds of information, but sufficiently constrained to enable the development of useful generic applications and infrastructure.

Lauren Wood and James Clark

Lauren Wood, chair of XML 2001, and James Clark

Yet this simplicity, which facilitates XML's wide arena of application, isn't enough. Various applications, required by software vendors and others, need more infrastructure in order to make progress. The key point, according to Clark, was that vendors' interests are often domain-specific and need to be implemented on top of the general infrastructure, rather than being lumped in with general-purpose functionality.

Members of the audience familiar with Clark's views might have surmised that he had a definite example in mind as he explained the need for appropriate layering, and indeed he went on to highlight W3C XML Schema as an example of poor layering. Focusing on W3C XML Schema's datatype system, Clark observed that standardizing some primitive datatypes was good, but that users often needed primitive types specific to their target application. He observed, "If I was ruler of the universe, I wouldn't have schema structures depending on datatypes. Rather, I'd have multiple plug-in sets of datatypes."

Don't Neglect the Foundations

Reports from XML2001

Growing Ideas at XML 2001

Patents and Web Standards Town Hall Meeting

Clark told the audience that now, much more depends on XML than was envisaged by its creators, because XML is finding many uses for which it was not designed by its creators. One of the key aspects at the time of XML's design, SGML compatibility, is now of much lesser importance. With the aim of cleaning up the foundation of XML, Clark proclaimed that we "should be free to stab the SGML community, what's left of it, in the back."

Now that the core XML standards--XML 1.0, XML Namespaces, XML Infoset, and XML Base--are complete, Clark described things that should be done to correct the errors made in the development of the core, and especially to redress the integration and layering of the core specifications.

On the Infoset, an "extremely important" specification, Clark recommended that it be integrated into XML 1.0 itself. He said the same for XML Namespaces, which had been layered on top of XML 1.0, but in actual fact ought to be part of the core XML specification. Even more radically, Clark declared that DTDs were "basically one big mess," and he proceeded to outline how they could be mostly replaced by new technologies for validation, infoset augmentation (e.g., attribute defaulting), and inclusion. However, a difficult problem remains in the form of character entities, which Clark observed were a "really knotty problem for which I have no solution."

From Clark's viewpoint, it is not hard to imagine that the recently released working draft of XML 1.1 must seem like a major missed opportunity for change. According to Clark, there are major problems with the core XML standards: They are poorly structured, incompletely specified, too hard to understand, and internally conceptually inconsistent. His vision of a much-improved XML 2.0 included adding in XML Namespaces, XML Base, and the Infoset, while subtracting DTDs and dealing with the problem of character entities.

Controlling the Processing Pipeline

Under the heading of "filling in the missing pieces," Clark outlined what he saw as the major omission from the current clutch of core XML standards: the ability to associate processing with an XML document. Such processing includes, but is not limited to, parsing, validation, XInclude processing, and XSLT processing. Currently, there is no way to associate sequences of processing with a document. For instance, there's no way one can indicate that a document must be validated before processing XInclude inclusions.

Existing techniques for indicating processing include the DOCTYPE declaration, the standalone pseudo-attribute on the XML declaration, the xsi:schemaLocation attribute and the xml-stylesheet processing instruction, all of which take a different approach. Clark highlighted that we need a general-purpose, extensible solution to the problem, which meets the needs of both the consumer of a document and its producer. Additionally, the mechanism should not be limited to the instance document: Processing should be determined by information both internal and external to the document being processed.

Improving XML Processing

Moving on to talk about the conventional ways XML was processed in programs, Clark protested that the current widespread APIs (SAX and DOM) made processing XML either too hard or too error-prone. He observed that these first generation APIs now lagged behind recent W3C Recommendations: Namespace support was "grafted on," and they are misaligned with the XML Infoset.

Echoing sentiments recently expressed in this publication, Clark said that SAX, though efficient, was very hard to use, and that DOM had obvious limitations due to the requirement that the document being processed be in memory. He suggested that what was needed was a standard "pull API," one that efficiently allowed random access to XML documents. Clark praised the XML APIs from Microsoft's C#/.NET platform in this regard, adding that Java could learn much from .NET: "Just because it comes from Microsoft, it's not necessarily bad."

Another approach to using XML in applications is data-binding, where there is a mapping from the XML document to programmatic data structures. Most existing solutions to this problem use annotations on a schema document, and Clark noted this was one approach. He observed that this was probably, however, working at the wrong level of abstraction for effective data modeling. He described two other approaches he thought promising: code-centric, where program classes could be annotated to indicate mappings to XML; and a modeling approach, where a higher-level representation such as UML is used to drive both schema and code generation.

Also in this section of his speech, and to wry grins from the audience, Clark said that he thought XSLT (of which he was the mastermind) was rather overused in areas for which it was not designed. He added that XSLT was not designed as a general purpose programming language as it lacks, for example, type-safety.

Avoiding Premature Standardization

In the closing section of his talk, Clark moved on to what too often seems to be the central engine of the XML world, and certainly was a major theme throughout the conference: standards. Taking a pragmatic stance, he told the audience that a standard is not always a good thing.

Clark said that standards, which are often of great benefit in areas that are well explored, have the potential to stifle innovation in areas where the subject matter is not well understood and is still developing. He noted, too, that such standards can be anti-competitive, preventing better ideas from emerging and taking hold. This is compounded by the fact that standards are often accepted uncritically by vendors and developers.

The sheer success of XML as a standard has led to a tendency to attempt to standardize many things which do not require it. Clark said that standards are required for data interchange, but not necessarily for processing models. Where processing is concerned, open source software can play as much a role at preventing vendor lock-in as standards.

A Significant Speech

Although there were those new to XML in the audience who didn't quite appreciate what Clark had to say, his speech was a very important one for the XML community. It summed up many of the issues expressed within the community over the last two years. In a sense, we were hearing little that was totally new; but the important thing was that it came from Clark, underlined by the fact that he showed that he was willing to throw away some of his own work and ideas in order to make XML better.

It seems unlikely that the increasingly conservative W3C will adopt Clark's more radical suggestions. However, the speech had an energizing effect on many who heard it, highlighting as it did the potential for grassroots community members to change and improve XML--with Clark leading by example.