Menu

Can XML Be The Same After W3C XML Schema?

June 19, 2002

Eric van der Vlist

Eric van der Vlist, author of O'Reilly's XML Schema: The W3C's Object-Oriented Descriptions for XML book, reflects on the way thisimportant specification changes the way we approach XML.

Introduction

Observation bias is a well known phenomenon in a number of disciplines: physicists, physicians and even marketers know that objects are changed by the simple fact that we observe them. Psychologists acknowledge that schemas bias us to see things in certain ways.

One of the unexpected effects of W3C XML Schema has been to show that these phenomena apply to XML. The "observation" of a XML document made by a schema processor changes the document, and schemas bias our perception of XML documents. This prompted me to ask the question: can XML be the same after W3C XML Schema?

What's Different About W3C XML Schema?

The first question to ask is why is W3C XML Schema different? Why am I asking this question about W3C XML Schema and not about DTDs, Schematron, or RELAX NG?

The short answer is datatypes and object orientation. These two aspects of W3C XML Schema are tightly coupled. Datatypes are to W3C XML Schema what classes are to object oriented programming languages. Both promote a categorization of information into classes and subclasses, analogous to the taxonomies biologists use to classify species.

Although this process of classification or derivation seems natural, it is not universal and is much less visible in other schema languages.

To reuse the metaphor of species, a rule based language such as Schematron does not attempt to put a sticker on a species, but rather set of rules defining if an animal belongs to a set of "valid" animals ("the set of animals having four legs and able to run at least 50 km/h").

Grammar based languages, including RELAX NG and DTD, describe the patterns used to build the animal ("an animal made of a body, a neck, a head, a tail, and four legs").

What makes W3C XML Schema different, and more likely to generate schema bias, is its ability to derive types from other types ("this is an animal / chordata / mammalia / carnivora / felidae / acinonyx / jubatus").

Why Use Classification Schemes?

Is such a classification scheme useful? Biologists and object oriented programmers seem to think so, and we must acknowledge that using hierarchical schemas offers lots of advantages. Classification and object-orientation are useful ways to leverage what we know at a general level to a more specific level. If I know that a cheetah is a mammal, I can infer further information about a cheetah -- for example, that it is warm-blooded and that female cheetahs nurse their young -- which I don't need to formalize specifically for the cheetah. I can infer things about cheetahs by virtue of knowing that a cheetah is an instance of a more general class and by knowing some things about the general class, mammal.

A similar principle applies to object orientation programming and XML. Knowing that an element or an attribute has a certain type may give me information, which I otherwise don't need to formalize explicitly, and it allows me to use algorithmic processes which apply to this type.

This is in fact the big promise of both object orientation and W3C XML Schema. If, instead of writing documentation and processes for each element and attribute (i.e., for each object), we are able to write documentation and process for each type (i.e., each class of objects), and if each of our types is used to describe several elements and attributes, we may hope for gains of productivity.

Working? Or Dangerous?

Years of experience with object-oriented programming has given mixed results. On the bright side, many libraries are available for the various object oriented-languages, reducing the costs of development. On the dark side, reusing components is not that natural for developers, and it is not always obvious how to turn existing components into those which can be usefully reused in a particular context.

It is yet to be seen how this will work for W3C XML Schema, and if people will be using those datatypes in practice. However, there is a significant difference between using an object-oriented programming language and using W3C XML Schema, one which may turn out to have an important effect on the success of the technology. While object-orientation is built into object-orientated programming languages, W3C XML Schema attempts to layer some object-orientation on top of XML documents, which are not natively object-oriented.

The benefit of building reusable type libraries is potentially huge in terms of interoperability and reusability; the life of developers would be so much easier if we had a universal type library to describe a name or an address. But there is a potential risk that must be highlighted before we follow this road.

In 20000 Leagues Under the Sea, Jules Verne describes Conseil, a biologist who knows exactly the classification of any species of fish, yet isn't able to recognize any of them in real life, and a fisherman, Ned Land, who is able to recognize every fish, but unable to classify any of them.

Applying this to XML and W3C XML Schema, I see a danger of creating two distinct and potentially incompatible types of XML applications: those which, like Ned Land, identify elements and attributes by their instances and those which, like Conseil, identify them by their datatypes. How you see XML in future may well be colored by whether you use W3C XML Schema or not.

Can XML be the same after W3C XML Schema?

That W3C XML Schema introduces observation bias seems obvious. Conseil and Ned Land do not see the same fish when they look at the same fish, and, similarly, the fact of assigning datatypes to elements and attributes changes the way we look at them. They are no longer simply syntactic constructs but are colored by the information we have about their datatypes

For the datatyping to be effective, the schema bias (i.e., the extra information) needs to be expressed and passed to applications. This is the purpose of the mysterious Post Schema Validation Infoset (PSVI), which, although not formally defined, is sent to W3C XML Schema aware applications by a W3C XML Schema processor.

The transformation of the XML Infoset (before validation) into a PSVI (after validation) can be seen as the observation bias induced by W3C XML Schema. Its adoption by XPath 2.0, XSLT 2.0, and XQuery 1.0 is a good indication that, for better or for worse, XML cannot be the same after W3C XML Schema.

For Better Or For Worse?

Like it or not, most of us will have to use W3C XML Schema sooner or later, and it's up to us to use it for the best and not for the worse. My approach to doing so -- which has also been my guideline while writing a book on W3C XML Schema for O'Reilly -- is to make a critical analysis of the features of the language, not taking anything for granted, and trying to see the consequences and pitfalls behind the wording of the W3C XML Schema Recommendation.

I am convinced that this is the only useful and practical way to approach this highly intrusive specification, and the purpose of my book is to guide the reader as safely as possible through this tour.

An Unexpected Pearl

My reward for digging into the W3C XML Schema Recommendation has been to discover an unexpected pearl far away from the limelight: W3C XML Schema is exceedingly good at associating metadata with elements or attributes.

Technically, this is done through "foreign namespace attributes" or xs:annotation elements, and I've created some exciting examples associating Dublin Core elements, SVG graphics, SAF elements, Schematron rules, XSLT snippets, or RDDL descriptions in-line or as references using these features.

These features can improve the documentation of XML vocabularies by providing a way to attach such information to elements, attributes, or datatypes within their context. In this domain W3C XML Schema exceeds what RELAX NG or Schematron can achieve.

Schematron is about rules, and RELAX NG is about patterns; neither of them describes elements or attributes as such. Schematron can define rules to be checked in the context of an element, and RELAX NG can describe a pattern containing a single element, but W3C XML Schema is the only one which can describe elements and attributes.

As long as validation is your primary concern, this may not make much difference, but for attaching metadata to elements and attributes, a language which describes elements and attributes seems to be a better fit.