Eric van der Vlist on W3C XML Schema
May 15, 2002
Eric van der Vlist, a regular contributor to XML.com, has just completed writing XML Schema: The W3C's Object-Oriented Descriptions for XML for O'Reilly, to be published in June 2002. In this interview he explains the importance of XML schema languages, and his motivations for writing the book.
O'Reilly: Why have you chosen this subject?
vdV: Because I think that the XML schema languages in general, and W3C XML Schema in particular, are the hot topics of the moment: being at the same time essential and potentially dangerous for XML. I thought that an objective book needed to be written, which would be a kind of map to W3C XML Schema, showing clearly not only the features and goodies but also the pitfalls of this specification.
O'Reilly: Essential and dangerous? Isn't this judgment rather excessive?
vdV: No, the lack of XML schema languages is simply not economically acceptable! An application must expect that the XML documents used as input follow some kind of structure, in order to be able to understand them. Formalizing this structure as "XML schemas" enables all kind of productivity, quality and performance improvements by automating tasks such as validation, code generation, data binding, documentation and query optimization.
O'Reilly: If XML schema languages are essential, why are they dangerous?
vdV: XML was born as a simplification of SGML, to make it usable on the Web. The most effective part of this simplification is the rejection of a mandatory DTD (which is a type of XML schema language). The current tendency to systematically use a schema and to build new base specifications such as XPath and XSLT 2.0 on top of W3C XML Schema can thus be considered a regression; was it necessary to remove the SGML DTD to impose W3C XML Schema, which is not really a simple specification?
O'Reilly: But still, there is a XML DTD like there is a SGML DTD.
vdV: Yes, but the XML DTD is simplified and, more important, optional: a XML parser does not need a DTD to be able to understand the structure (or infoset) of XML documents, while a SGML parser needs a DTD just to be able to parse the document and understand its structure. For XML, a DTD (or any schema language) is just an optional external source of information, which may help to validate the document and understand its content (as opposed to its structure).
O'Reilly: Why was it necessary to create a new schema language when we already had DTDs?
vdV: The XML DTD was specified in the XML 1.0 recommendation, published before Namespaces in XML 1.0. The XML DTD ignores the notion of namespace and lacks the flexibility necessary to support them in a simple way. The XML DTD is also a descendant of the SGML DTD, which had been designed for document-oriented applications, and lacks a complete type system -- a requirement for data oriented applications.
The W3C had the choice between updating the specification of the DTD or creating a new specification; it chose to start anew. I guess that the interoperability issues linked with any modification of the XML 1.0 recommendation have influenced this decision: it is often easier to create a new standard than to update an existing one, especially when it's a successful one!
O'Reilly: So, we have now the choice between two schema languages, W3C XML Schema and XML DTD?
vdV: No, we have a much wider choice. We can of course use XML DTDs or W3C XML Schema, but we can also continue to use SGML DTDs (more powerful than XML ones), the "ancestors" of W3C XML Schema developed as interim solutions before W3C XML Schema was published (XDR and SOX are the main precursor languages) or use "new" schema languages such as Schematron or RELAX NG -- or even ASN.1, whose usage to validate XML documents has been recently officially published as a specification.
O'Reilly: Isn't there a risk of creating a lot of confusion with this diversity?
vdV: Yes, of course. On the other hand, would it be a good thing or even possible to impose a single XML schema language? Being external to the XML documents, a schema is somewhat comparable to a program, and trying to impose a single schema language would probably be as ridiculous as trying to impose a single programming language. These languages are different, they work at different levels and are adapted to different kind of applications.
O'Reilly: But how can we find our way?
vdV: By reading comparisons between schema languages such as the articles I have written, but also following the work done by the DSDL ISO project, whose goal is to classify the XML schema languages, to define processing models relying on them and also to study how the XML DTD may be used with namespaces.
O'Reilly: ISO? But isn't it the W3C that specifies XML?
vdV: W3C has published most, if not all, base XML specifications, but it doesn't own XML. There are other consortiums and organizations, such as IETF and OASIS, which publish specifications related to XML, and one shouldn't forget the ISO, which is the only organization allowed to publish official international standards.
O'Reilly: Still, why isn't this project lead by the W3C?
vdV: Ask the W3C. I've noticed that the W3C tends to promote the their own specifications and that they are not very interested in external work. In this case, W3C is showing a worrying trend to promote exclusive usage of W3C XML Schema which is a new type of "vendor lock-in".
O'Reilly: Why would someone want to use another language than W3C XML Schema?
vdV: DSDL proposes a classification of schema languages in three categories:
- Rule based languages (such as Schematron), defining the rules to be followed by a class of XML documents.
- Grammar based languages (such as RELAX NG), defining the structure of a class of XML documents as a grammar or a set of patterns.
- Object oriented languages (disclaimer: I am the editor of this section of the DSDL work), describing a class of XML documents as object oriented structures facilitating the mapping between XML documents and object oriented applications.
This classification shows that the XML schema languages are very different and could be considered more complementary than competing. If we had to define these schema languages from scratch today, with the experience we have acquired and putting aside any political considerations, I think that we could even define them as layers: a rule based language would be the foundation of a grammar based language, on top of which an object oriented language could be defined.
O'Reilly: Why is your book only about W3C XML Schema, then?
vdV: I think that most of my readers will be primarily (if not only) interested by W3C XML Schema, and I haven't wanted to add the weight of a detailed description of other schema languages. I am also convinced that W3C XML Schema is the one which needs, before any other, as many books as possible.
vdV: First because I think that the W3C XML Schema recommendation is the most complex specification ever published by the W3C. The technology itself is complex, and the specification has been written in a way which is very difficult to read. Also, I think that many experts lack the objectivity necessary to show the limitations and pitfalls of the technology.
O'Reilly: You are a native French speaker, why have you written this book in English?
vdV: To make it immediately available to the largest group of readers, and also because it has proved easier to organize translations from English to French than from French to English.
O'Reilly: And you have not preferred to translate your own book?
vdV: The decision was difficult, but I finally preferred to ask to Jean-Jacques Thomasson, who has done a very good job translating W3C recommendations for XMLfr, for two reasons: we have been able to work in parallel, and the book will be published in French at the same time as its original English version. After more than a full year of work on the subject, I wanted also to move on to new projects.
O'Reilly: New projects?
vdV: I tend to find more new interesting projects than I have time to work on. For instance a new site similar to XMLfr, my participation to DSDL, an ambitious project involving a namespace search engine, and at least two new book projects.