Menu

XML Versus the Infoset

November 20, 2002

Rich Salz

Caveat Lector

In my prior columns I've examined specific protocols and pieces of the web services infrastructure, including WSDL, SOAP attachments, and so on. I've picked them up, looked all around, and tried to figure out when and how to use them. More importantly, I've tried to uncover some of the dark corners and trade-offs associated with these technologies. While there's certainly room for disagreement about some of the suggestions, I think it's clear that there's a technical basis behind all the commentary.

But in this column I venture a bit further afield; it's more like an op-ed than straight reportage: less fact, more opinion, if you will.

XML ::= 'X' 'M' 'L'

According to XML's formal definition, it's a syntax for writing structured content. The heart of the specification is a set of 83 syntax rules, written in an extended Backus-Naur Form (BNF). BNF was created by Backus & Naur to describe the syntax of Algol around 1960. (For some background, read the nice description of BNF's history.)

In order to make BNF grammars shorter and more powerful, it's common to extend them with facilities for character ranges, regexp-style patterns, repetition operators, etc. Twenty-five years ago the IETF devised Augmented BNF to describe email headers in RFC 733 and its successor RFC 822. In 1997 it was separated into its own definition, available as RFC 2234. The other common BNF extension is the EBNF used to describe XML. Ideally the two will evolve into a single common form, although we'd need a new name: perhaps Super BNF?

As is appropriate for a syntax specification, the EBNF rules in the XML specification are very precise. They specify -- down to the byte -- what XML is. For example, look at the XML declaration:

<?xml version="1.0" encoding="utf-8"?>

Actually, that's too much. Let's look at just the encoding attribute. In prose, we'd say something like this: The word "encoding", followed by an equal sign and then an identifier, in quotes, that names the encoding.

In the XML EBNF, this takes four rules, which are formally known as "productions", to specify. Production 80 looks like this:


    [80]

    EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )

The items in bold -- S, Eq, EncName -- are non-terminals, which means they're defined elsewhere. The Eq refers to production 25, which defines it as an equals sign with optional whitespace before and after:


    [25]

    Eq ::= S? '=' S?

As with regular expressions, the question mark means no more than one instance. Production 3 defines the space characters -- a new one is added in XML 1.1 -- and production 81 defines EncName, the encoding name:


    [3]

    S ::= (#x20 | #x9 | #xD | #xA)+

    [81]

    EncName ::= [A-Za-z] ([A-Za-z0-9.-] | '-')*

EBNF is reasonably simple, very precise, and extremely low-level. Parser writers and compiler jocks like it because it makes (part of) their job fairly simple and error-free. Because it's so low-level it can become quite tedious, requiring you to maintain a lot of context in your head. For example, XML Namespaces adds 18 additional productions; all to, according to the XML specification, "[assign] a meaning to names containing colon characters." And "meaning" there means semantics, while EBNF is all about syntax.

There are things that EBNF can't do. For example, it can't say that an element's beginning and ending tag must nest, that they can't overlap. That part of the specification is left to the 37 constraints in the document. With the EBNF, they form a very strict, precise definition of a very flexible interchange format.

Although corporate politics often get in the way, distributed computing wonks really want universal connectivity. At the time XML was being defined, we had a common network protocol, TCP/IP. We had a common request-response protocol that -- because it was designed to support gatewaying -- we could easily hijack, HTTP. The next step up the ladder was a universal data interchange format or network syntax. It shouldn't be surprising that XML was seen as a solution to the archipelago of data islands composed of ONC NDR, DCE/DCOM NDR, Corba IIOP, Java RMI, and so on.

For distributed computing, which is most commonly about "bytes on the wire," the XML specification is all that's needed.

The [information set]

The XML Information Set (a.k.a. "the Infoset„) is a very interesting document. According to its abstract, it provides "a set of definitions for use in other specifications that need to refer to the information in an XML document.„

For example, SOAP 1.1 is based on the XML specification. Section 4.2.3 describes the mustUnderstand attribute as follows:

The SOAP mustUnderstand global attribute can be used to indicate whether a header entry is mandatory or optional for the recipient to process...The value of the mustUnderstand attribute is either "1" or "0". The absence of the SOAP mustUnderstand attribute is semantically equivalent to its presence with the value "0".

But SOAP 1.2 is based on the Infoset. Section 5.2.3 includes the following description of mustUnderstand:

The SOAP mustUnderstand attribute information item is used to indicate whether the processing of a SOAP header block is mandatory or optional...

The mustUnderstand attribute information item has the following Infoset properties:

  • A [local name] of mustUnderstand.
  • A [namespace name] of "http://www.w3.org/2002/06/soap-envelope".
  • A [specified] property with a value of "true".

The type of the mustUnderstand attribute information item is boolean in the namespace "http://www.w3.org/2001/XMLSchema". Omitting this attribute information item is defined as being semantically equivalent to including it with a value of "false" or "0".

What's changed? First the new text is almost twice as long. It's also very precise and formal, even turgid. Reading SOAP 1.1 was almost fun; reading SOAP 1.2 will be a job. And did you notice that the attribute is now in the SOAP namespace or did that get lost in the verbiage?

The terminology introduced by the Infoset is so cumbersome that it seems impossible to write clear and lucid text using it; a problem which is compounded by the awkward square-bracket syntax it uses. As a dictionary -- something that provides definitions -- it earns poor marks.

And that's especially true when you compare the Infoset with the precision of the XML specification, in which light the Infoset seems a rather loose document indeed. For example, the legal set of XML element names is precisely defined by a couple-dozen EBNF productions in the XML 1.0 and namespaces documents. The Infoset just says:

[local name] The local part of the element-type name. This does not include any namespace prefix or the following colon.

If the Infoset used a formal notation like ASN.1, then we might have something like


Element ::= SEQUENCE {

                    NamespaceName URIType OPTIONAL

                    LocalName NameType

                    Prefix NameType OPTIONAL

                    Children SEQUENCE OF NodeType

                }

The ASN.1 keywords are in uppercase: SEQUENCE introduces a structure, SEQUENCE OF indicates an array, and OPTIONAL has the obvious meaning. By leaving the type declarations -- NameType, NodeType, etc. -- undefined, an ASN.1 Infoset could allow separate documents to define the formal XML serialization or optimizations for native datatypes and local processing.

The Infoset document is not a complete list of information items, just those "of expected usefulness in future specifications." XML Schema, for example, adds data types and additional child nodes if defaults were applied. Unfortunately, there doesn't seem to be a single comprehensive list of them. Those of us sending bits on the wire are often dismayed to read formal specifications built on other property lists that "we thought might be useful."

Some are also discomfited by the fact that it is possible to create an XML Infoset that can't be serialized as XML. For example, it is possible to create an Infoset by other than parsing XML. Since there are no constraints on the [local name] information item of an element, a local processor could create an item with the value "hello world". The developer would then have a legitimate point that their application is XML-compliant, even though it is impossible to create an XML instance of the document they created.

Finally, let's look at the issue which drove all of this home for me. Cryptography is all about manipulating bytes; it's impossible to sign an Infoset, which must be, at some point, concretely represented. SOAP 1.2 currently allows processors a great deal of flexibility about whitespace between elements in the SOAP header; after all, in the SOAP processing model they're irrelevant Infoset items.

More from Rich Salz

SOA Made Real

SOA Made Simple

The xml:id Conundrum

Freeze the Core

WSDL 2: Just Say No

If an entity can modify this whitespace, it's impossible to reliably sign a SOAP header: you have to create a signature that signs each individual header element. Which isn't the same thing, because you then need to synthesize an additional signed datum in order to prevent someone from adding a new element which isn't signed.

In all fairness, the SOAP WG is working to address this by pointing out the problems and resurrecting a SOAP Canonicalization scheme I circulated last year. But we should be concerned that this issue has shown itself only now, while the SOAP specification is in last call.

I can only wonder what will happen the next time XML and its Infoset come into conflict.