Investigating the Infoset

August 2, 2000

Leigh Dodds

What is the XML Infoset specification? What purpose does it serve? These are some of the questions that have been discussed on XML-DEV this week. The XML-Deviant was there to record the answers.

The Infoset

The latest draft of the XML Information Set ("Infoset") specification was published this week, providing an update to the previous December 1999 draft. The Infoset is one of those specifications frequently mentioned, but rarely discussed in detail. Paul Abrahams, no doubt voicing the thoughts of many other developers, wondered what the purpose of the Infoset was:

What is the purpose of the XML Infoset? Is it mainly intended to enlighten implementors about what the abstract structure of an XML document is, or does it have some other less obvious uses?

The resulting discussion provided a useful primer for developers interested in learning more about the Infoset specification.

Jonathan Borden described the Infoset as an abstract model of the data in an XML document:

XML is a serialization of a logical document structure defined by the XML Infoset.

Martin Gudgin echoed this view, saying that the abstract model separates applications from the syntax:

To me the Infoset defines what XML is in the abstract. XML 1.0 + namespaces is just one possible serialization syntax. I expect there will be others in time. Likewise SAX and DOM are two possible reflections of the Infoset. Maybe other APIs will be developed over time. The Infoset, being abstract, shields me from the details of the serialization syntax which to me is a big win. If I find ( or write ) a parser that supports a binary form of XML but still conforms to the Infoset I don't need to change any of my application code but I can get all the benefits ( probably size and speed ) of the new serialization syntax.

The Infoset then is a data model that describes the important properties of a well-formed XML document. The model describes the results of parsing an XML document, and it is this model that is manipulated by XML APIs. This view puts the XML data model first, and the syntax second.

Summarizing responses from several contributors, Paul Abrahams asked a further series of questions:

... doesn't the XML spec itself define well-formedness satisfactorily?...

... Viewed as an elegant description of the information contained in an XML document, the Infoset make sense. But unlike the other XML specs, its normative effect is unclear. If I'm implementing an XML-related processor of any variety, what does the Infoset require me to do that I would not have to do if the Infoset never existed?

Michael Champion offered an explanation of how the Infoset refines the definition of well-formedness given in the XML specification:

[The Infoset] answers questions that are irrelevant when XML is viewed as a syntax, but quite important to users of the DOM, XPath, XSL, etc. that operate on some representation of a more abstract parsed XML document. For example, the XML spec says that "<empty></empty>" and "<empty/>" are both well formed XML elements, but nothing about whether they are equivalent. Infoset says ... that they are.

Champion also provided an example of the type of question that the Infoset answers for application designers:

So, one fairly practical normative question it *does* answer would be: 'My application would like to treat "<empty></empty>" as signifying "data will the value NULL" and "<empty/>" as signifying "no data". Can I do this in a environment where the XML will be processed by various tools that implement the XML specs but that I do not control?' The answer, for better or worse, is NO - an XML processor is under no obligation to preserve this distinction. That answer comes from the Infoset ... not the XML spec, the DOM, XSLT, etc.

The Infoset is therefore a normalized data model that irons out variations in syntax, to provide a foundation upon which XML applications and processors can be built.

Syntax versus Model

In some ways, the Infoset poses a chicken-and-egg problem. If the data model is more important than the syntax then why (and how) was XML specified before its data model was defined? Michael Champion admitted that a lack of a data model made the DOM Level One specification harder to produce:

The lack of an Infoset certainly made it much harder to invent the Level 1 DOM; it simply was not clear (and was highly contentious) whether expanded entity references remained in the XML document tree or not... and how mixed content would be represented in the tree.

Jonathan Borden believed that specifying the DOM was only possible because of prior work on SGML:

True, the DOM spec was written prior to the Infoset spec, but I think that the only reason this was possible is because of all the work on groves and property sets that had already been done for SGML, so the people who devised the DOM already had a pretty good idea of what the Infoset would look like.

Tim Bray disputed the relative importance of the XML model over its syntax, claiming that standardized syntax is how interoperability is really achieved:

XML took a lot of static in its early days because it was "just syntax" - there are certainly a lot of people who want to think only in terms of object models (groves, DOMs, whatever) and see the syntax as disposable fluff. Me, I think syntax is crucial. Because describing data structures in a straightforward, interoperable way is really hard to get right and very often fails. At the end of the day, if you really want to interoperate, you have to describe the bits on the wire. That's what XML does.

Think of it another way... a promise like "my implementation of SQL (or posix, or DOM, or XLib) will interoperate with yours" is really hard to keep. A promise like "I'll ship you well-formed XML docs containing only the following tags and attributes" is remarkably, dramatically, repeatably more plausible in the real world.

This is a debate that recurs often when attempting to define markup languages: do you begin with a model and then define a syntax, or build a model that describes the syntax? It's not a debate that is likely to be resolved anytime soon, if at all. The important point is that you cannot focus on one aspect--model or syntax--to the exclusion of the other. The Infoset is therefore an important step in the further development of the XML family of specifications.

The 80/20 Split

Another of those recurring debates concerning the details of a specification is the "80/20 Split." It's impossible for a single specification to address all possible requirements, and so compromises have to be made. Disputes arise from opinions on where that split needs to be made, and which compromises are tolerable. Inevitably, a similar debate has revolved around the details of the Infoset data model.

Whilst praising the intent of the Infoset, Michael Kay asserted that the specification makes too many compromises:

Personally, I don't have any problems identifying the need for the Infoset: I've seen so many people try to attach meaning to lexical distinctions that should not carry meaning that I yearn for an authority I can point to when telling them they're wrong.

But the problem with the Infoset as currently defined is that it has had to make too many compromises. Creating a common abstraction with the constraint that XML, XML Namespaces, the DOM, and XPath should all conform with it is, I think, a requirement that has proved impossible to satisfy.

Some developers expressed concerns over the information that the Infoset does not model -- i.e., the information on the wrong side of the 80/20 split. Indeed, Simon St. Laurent advocated extending the model to cover all available information, with the option of defining subsets as a later effort:

I'd suggest that the Infoset's designers build for a wider XML-using audience than the particular one they have envisioned, and then describe a subset and perhaps the processing that takes information from XML syntax to parser output.

While support for this suggestion was forthcoming from several contributors, many were happy with how the model had been defined. Joe English observed that a subset is still useful for many applications:

Having a canonical "subsetted" model like the Infoset is very important to tool-builders, spec writers, and schema designers though. Without it, it's all to easy to design an application that relies on properties of the input document that most tools consider accidental syntactic properties; then documents built in conformance with that application can't be processed with those tools. This has happened to me a couple of times when dealing with SGML.

Sean McGrath saw this as unacceptable--syntactic differences may be important for some applications:

But distinctions that are irrelevant for some applications are not irrelevant for others. This is the nub of the problem. The Infoset throws certain things away. In so doing, it creates problems for certain types of XML processing applications.

Eric Bohlman highlighted one class of applications that the Infoset doesn't support:

Of course, there are always going to be certain applications that really have to work with the lexical details of the syntactic instance rather than its Infoset; these are editor-type applications that need to preserve aspects of the lexical (physical) structure of the original document.

Trying to defuse the arguments, Rick Jelliffe attempted to further clarify the purpose of the Infoset specification. Jelliffe described the Infoset as defining a policy that other W3C specifications will follow:

The Infoset is aimed at XML specifications and software in general. It is not its intent to state all the information that anyone could encode in their document. I would say that in particular it is setting a policy that W3C XML specs should not operate as if the formatting of the XML markup was significant.

This is not a new issue: I remember it being discussed 3 years ago or so. It is good for XML editors to regenerate edited documents with the original formatting of the markup. That is why it is useful if SAX reports rather than collapses whitespace, and why a DOM implementation for an interactive editor should subclass the W3C DOM to provide this information. That is their Infoset, but it is not the one that W3C Working Groups should start from.

These latter points from Bohlman and Jelliffe are important because they highlight that the Infoset only fails to support a small subset of XML applications: a hit rate of much higher than eighty per cent.

For the majority of XML developers, the Infoset will serve as a useful adjunct to the XML specification: complementing the syntax to build an interoperable data model upon which XML processors can be layered.