Processing Model Considered Essential

March 13, 2002

This week's XML-Deviant takes a step backwards in an attempt to foreground an issue that has been behind several recent debates in the XML community, namely, the lack of a processing model for XML.

Historical Context

It's historical fact that the syntax of XML was defined before its data model, the XML Information Set (Infoset). While this contributed to the speed of delivery of the XML specification, it also lead to a number of subsequent problems; most notably, the discontinuities between the DOM and XPath, both of which define different tree models for XML documents.

Looking at the plethora of additional specifications that have been subsequently produced, it is useful to characterize their functionality as specific manipulations on an infoset. For example, XInclude augments an infoset, XSLT transforms an infoset, and schema validation annotates an infoset with type and validity information.

While valid in an abstract sense, this perspective is missing a statement of the possible orderings of these operations. Do certain operations need to be performed before others? Must entities be resolved before XSLT processing? Must one canonicalize a document before generating its signature ? How does one specify the order of operations to be carried out on a document? How do I state that I want to do a schema validation only after I've carried out all inclusions? Or vice versa?

The W3C held an XML Processing Model Workshop in July, 2001, to begin discussing these issues. The scope of the workshop explains that

...the relationship among...W3C specifications is unspecified -- in particular the sequence in which the infoset-to-infoset transformations may or must be performed.

There is also no specification, and currently no W3C work item, for specifying how an author or application programmer could specify an order to the various transformations. Thus it appears there is a missing specification in the XML activity.

The document continues by noting some potential outcomes for the workshop, including "creating a new Working Group to address the topic of the XML processing model, adding the problem to the charter of an existing Working Group, tasking the W3C Technical Architecture Board to consider the question, and others."

Unfortunately seven months later there's been little in the way of visible progress. The position papers submitted by the workshop attendees are for W3C member consumption only, despite the stated desire to "raise awareness of the issues". Luckily, however, a brief time spent googling yields three papers that have not been locked away from the public eye. Reviewing these papers is a useful exercise; they provide an interesting perspective from which to revisit a number of debates.

XML Infoset: Primary or Secondary?

The first paper of note was submitted by Eric Miller and Dan Brickley. The paper defines the position of RDF in relation to XML and the XML Processing Model. RDF has its own triple-based data model and uses the XML infoset as a means of generating an XML serialization of that model. RDF processors also consume XML infosets to extract information from them.

What's interesting is the positioning of the two data models: for RDF applications, the XML data model is of secondary importance. There is a separate primary data model upon which RDF applications operate. The same is true for Topic Maps and other areas. In fact this view, that the XML infoset is merely a means to an end, appears to lie behind much of the work to map XML into relational or object-oriented models. It's this other model that is the primary influence.

For others it is the XML syntax or, only a small step removed, the XML infoset which is of primary importance. This group is interested in defining tools that work as closely with the XML markup as possible. Whether the infoset exposes all the information they might want is a separate concern: the infoset (or an equivalent model) is primary.

How many of the apparent divisions in the XML community are really divisions between these alternate viewpoints?

Mike Champion suggested recently on XML-DEV that

...XML has proved the concept that a standardized, text based meta-syntax can do the job in principle. That brings in people who have no vested interest in backwards-compatibility with all sorts of legacy stuff, don't care about the "way it is done in XML," but need to get the job done going forward.

For some "getting the job done" means making an existing application, with it's own processing model, able to understand and create XML. From this perspective, anything complex that XML puts in the way is not only frustrating, but also unnecessary. This view is no more or less valid than any other, but it is a perspective that should be understood and properly addressed.

Pipelines and Dispatching

The second paper, "Distributed XML Processing Models", by Mark Nottingham opens with a succinct statement of the wide reaching issues:

The increasing number and complexity of XML-related specifications (e.g., Namespaces, XSLT, Schema, XInclude, XBase) as well as inherent functions of XML (entity resolution and validation) have created the need for an XML processing model, in order to disambiguate the order and depth of processing when applying these mechanisms.

Nottingham identifies three issues that need to be considered when defining the processing model. First, how do we know which processor(s) to apply to a given document? Second, in what order should the processors be applied. Third, at what level of granularity does processing take place (i.e., the whole document or just parts of if?). Nottingham also attempts to "identify similarities between the issues of XML processing and Distributed Web Processing."

These issues will be familiar to regular XML-Deviant readers. How one should discover resources associated with a document, including processing descriptions, and whether one dispatches to processes based on media type, document type, or namespaces have been topics covered in two recent articles: "TAG: Managing the Complex Web" and " Document Associations".

Pipelining is de rigueur at the moment, with efforts like XPipe, DSDL, and the recent XML-Pipeline W3C submission all taking the same fundamental approach to these problems. We can also perhaps begin to see that the root cause of recent debates is the imprecise definition of the XML processing model.

Ordering and Dependencies

The last paper was written by Dan Connolly. He briefly reviews the issues surrounding the late formulation of the infoset and asserts that the processing model "should, ideally, address outstanding issues regarding the ordering and dependencies among XML Base, XInclude, and XML Schema". Connolly also suggests that XML Schema should be dependent on the completion of inclusion processing:

...It seems most straightforward, technically, to put XInclude processing "before" the rest of XML Schema validation, much the way XML 1.0 entity resolution goes "before" content model validation. This seems to make XInclude deployment dependent on a revision of the XML Schema specification.

This last point is interesting as it echoes a recent thread on XML-DEV in which Simon St. Laurent raised concerns about the relationships between XInclude and Canonical XML. St. Laurent also observed that the UDDI project has defined its own Schema Centric Canonicalization specification which attempts to address issues with other canonicalization mechanisms that "significantly limit their utility in many XML applications, particularly those which validate and process XML data according to the rules of and flexibilities afforded by XML Schema."

There isn't space to cover the subsequent discussion in full here; however I have summarized aspects of it in the XML-DEV weblog. It's worth noting that much of the disagreement revolved around the differences between inclusions as achieved by entities, rather than by XInclude. This lead Eric van der Vlist to make a plea for a processing model that would allow the ordering of inclusions to be explicitly defined:

I want to stress the need of a processing model definition for XML and be able to define if for a specific application I want XInclude (or external parsed entities) to be resolved before or after the c14n transformation.

Exactly like I want to be able to say if I want to apply XInclude before or after processing a schema.

I don't know if it's likely to happen, but I believe that this is a requirement if we want to move forward with the increasing complexity of XML processing.

Also in XML-Deviant

The More Things Change

Van der Vlist also requested that the TAG consider the fact that while the Post-Schema-Validation Infoset is becoming a key aspect of many specifications, it is currently not directly exposed by applications, nor does it have a defined interchange format. Van der Vlist believed that this could lead to monolithic applications, rather than the modular design that would fall out of a well-defined processing framework.

In another recent XML-DEV discussion, concerning labeling instance documents with the XML Schema type definitions, Joe English made some comments about infoset annotation, which also promotes a modular approach to processors:

The general idea of Infoset augmentation is I think very useful, but I'm starting to think that doing it as part of validation is not a good idea. Schema languages for validation should be as powerful (and expensive) as needed to express a wide range of constraints, but schema languages for augmentation should be as simple (and cheap) as possible.

RELAX-NG and Schematron are at the right level of complexity for validation languages; they're very expressive, but they don't add anything to the infoset other than "OK/not OK"

...W3C XML Schema tries to do both at the same time, and both sides suffer...

Again regular readers will recognize a familiar ring to these comments. We can look back over several years worth of simplification, layering, and refactoring debates and see the desire for a clearer processing model at the core of each of them. Tim Bray's skunkworks XML 2.0 proposal is the most recent embodiment of this.

Conclusions

While defining a processing model for XML won't magically end all debates overnight, it certainly seems that many disagreements may have at their core a different understanding or viewpoint about what that processing model actually should be. Seeing it clearly stated will certainly provide a common frame of reference, if not an organizing principle, for both specifications and applications.

This is particularly important for those who don't routinely spend their time immersed in pointy brackets. A processing model will give a conceptual framework for understanding the functionality provided by individual specifications, and it will also allow competing specifications to be properly judged on their own merits.

Creating an XML application should be like creating a mosaic: piecing together simple, well-defined pieces to create a whole. The complexity and richness should arise from how that whole is constructed. Individual pieces that don't fit should be clipped accordingly.

It's time for the W3C to organize its output around a consistent processing model. A processing model is not merely desirable, it's essential.