Comparing XML Schema Languages
This article explains what an XML schema language is and which features the different schema languages possess. It also documents the development of the major schema language families -- DTDs, W3C XML Schema, and RELAX NG -- and compares the features of DTDs, W3C XML Schema, RELAX NG, Schematron, and Examplotron.
In ordinary English, a schema is defined as "an outline or image universally applicable to a general conception, under which it is likely to be presented to the mind; as, five dots in a line are a schema of the number five; a preceding and succeeding event are a schema of cause and effect" (Websters).
The English language definition of schema does not really apply to XML schema languages. Most of the schema languages are too complex to "present to the mind" or to a program the instance documents that they describe, and, more importantly and less subjectively, they often focus on defining validation rules more than on modeling a class of documents.
All XML schema languages define transformations to apply to a class of instance documents. XML schemas should be thought of as transformations. These transformations take instance documents as input and produce a validation report, which includes at least a return code reporting whether hhe document is valid and an optional Post Schema Validation Infoset (PSVI), updating the original document's infoset (the information obtained from the XML document by the parser) with additional information (default values, datatypes, etc.)
One important consequence of realizing that XML schemas define transformations is that one should consider general purpose transformation languages and APIs as alternatives when choosing a schema language.
Before we dive into the features of XML schema languages, I'd like to step back and look at the downsides of the use of any schema language.
One of the key strengths of XML, sometimes called "late binding," is the decoupling of the writer and the reader of an XML document: this gives the reader the ability to have its own interpretation and understanding of the document. By being more prescriptive about the way to interpret a document, XML schema languages reduce the possibility of erroneous interpretation but also create the possibility of unexpectedly adding "value" to the document by creating interpretations not apparent from an examination of the document itself.
Furthermore, modeling an XML tree is very complex, and the schema languages often make a judgment on "good" and "bad" practices in order to limit their complexity and consequent validation processing times. Such limitations also reduce the set of possibilities offered to XML designers. Reducing the set of possibilities offered by a still relatively young technology, that is, premature optimization, is a risk, since these "good" or "bad" practices are still ill-defined and rapidly evolving.
The many advantages of using and widely distributing XML schemas must be balanced against the risk of narrowing the flexibility and extensibility of XML.
A document conforming to a particular schema is said to be valid, and the process of checking that conformance is called validation. We can differentiate between at least four levels of validation enabled by schema languages:
The validation of the markup -- controlling the structure of a document.
The validation of the content of individual leaf nodes (datatyping)
The validation of integrity, i.e. of the links between nodes within a document or between documents.
Any other tests (often called "business rules").
Validating markup and datatypes are the most powerful (or most dangerous, since they often imply a kind of modeling which limits diversity of the markup and datatypes). Link validation, especially between different documents, is poorly covered by the current schema languages.
The complete list of markup schema languages is long and would need to include languages developed for SGML to be complete. The list which I propose below is not exhaustive, and it includes only the major proposals that have influenced the schema languages covered in this article.
The DTD Family
A simplified version of SGML DTDs was introduced in the XML 1.0 Recommendation (XML) . Even though a DTD is not mandatory for an application to read and understand a XML document, many developers recommend writing DTDs for your XML applications.
The W3C XML Schema Family
The W3C XML Schema Working Group received many proposals contributed as notes:
XML-Data, submitted as a note (XML-Data) in January 1998 by Microsoft, DataChannel, Arbortext, Inso Corporation, and University of Edinburgh, included most of the basic concepts developed by W3C XML Schema. Although the details were not fully developed, the note covered a lot of ground which has been kept out of W3C XML Schema, such as internal and external entity definitions and the mapping to RDF (Resource Description Framework) and OOP structures.
XML-Data-Reduced (XDR), submitted in July 1998 (XDR) by Microsoft and University of Edinburgh was presented to "refine and subset those ideas down to a more manageable size in order to allow faster progress toward adopting a new schema language for XML" (mappings were left out). XDR has been implemented by Microsoft and used by the BizTalk framework.
DCD (Document Content Description for XML), also submitted in July 1998 (DCD) by Textuality, Microsoft, and IBM was a "subset of the XML-Data Submission (XML-Data) and expresses it in a way which is consistent with the ongoing W3C RDF (Resource Description Framework) effort". Mapping considerations were left out, but the language took care to be consistent with RDF through features such as "Interchangeability of Elements and Attributes."
SOX (Schema for Object-Oriented XML) was developed by Veo Systems/Commerce One and submitted as a note in September 1998 (a second version was submitted in July 1999 (SOX) as "informed by the XML 1.0 specification as well as the XML-Data submission (XML-Data), the Document Content Description submission (DCD) and the EXPRESS language reference manual (ISO-10303-11)". SOX was very influenced by OOP language design and included concepts of interface and implementation, but it was also influenced by DTDs and also included a support for "parameters". SOX has been widely used by Commerce One.
DDML (Document Definition Markup Language or XSchema) was the "result of contributions from a large number of people on the XML-Dev mailing list, coordinated by a smaller group of editors" (Ronald Bourret , John Cowan, Ingo Macherius, and Simon St. Laurent) and was submitted as a note in January 1999 (DDML). Its purpose was to "encode the logical (as opposed to physical) content of DTDs in an XML document". Great attention had been paid to the definition of the back and forward conversions back between DTDs and DDML, and the document also included an "experimental" chapter proposing "Inline DDML elements". DDML made a clear distinction between structures and data and left datatypes out.
W3C XML Schema, published as a Recommendation in May 2001 (XMLS0, XMLS1, XMLS2) acknowledges the influence of DCD, DDML, SOX, XML-Data, and XDR in its list of references and appears to have picked pieces from each of these proposals but is also a compromise between them. The main sponsors of the two languages still actively used and developed (Microsoft for XDR and Commerce One for SOX) have both announced that they would support W3C XML Schema for their new developments, and W3C XML Schema should become the only surviving member of this family.
The RELAX NG Family
First published in March 2000 as a Japanese ISO Standard Technical Report written by Murata Makoto, Regular Language description for XML Core (RELAX) (RLX) is both simple ("Tired of complicated specifications? You just RELAX !") and built on a solid mathematical foundation (the adaptation of the hedge automata theory to XML trees). It was approved as an ISO/IEC Technical Report in May 2001.
XDuce (XDUCE) was first announced in March 2000: "XDuce ('transduce') is a typed programming language that is specifically designed for processing XML data. One can read an XML document as an XDuce value, extract information from it or convert it to another format, and write out the result value as an XML document". Although its purpose is not to be a schema language, its typing system has influenced the schema languages.
Published by James Clark in January 2001, TREX (Tree Regular Expressions for XML) (TREX) is "basically the type system of XDuce with an XML syntax and with a bunch of additional features". The names and content model of the elements used to define the tree patterns of a TREX schema have been carefully chosen, and TREX schemas are usually as easy to read as a plain text description. The simplicity of the structure of the language also allows the resurrection of a consistent treatment between elements and attributes, a feature lost since DCD.
Announced in May 2001, RELAX NG (RELAX New Generation) is the result of a merger of RELAX and TREX, developed by an OASIS TC (RNG) and coedited by James Clark and Murata Makoto: "The key features of RELAX NG are that it is simple, easy to learn, uses XML syntax, does not change the information set of an XML document, supports XML namespaces, treats attributes uniformly with elements so far as possible, has unrestricted support for unordered content, has unrestricted support for mixed content, has a solid theoretical basis, and can partner with a separate datatyping language (such W3C XML Schema Datatypes)". RELAX NG is now an official specification of the OASIS RELAX NG Technical Committee and will probably progress to become an ISO/IEC TR.
|Got a question about the schema languages surveyed in this article, or a different opinion to the author? Share in our forum.|
|Post your comments|
Nontypical of schema languages, Schematron (SCH) was first proposed in September 1999 by Rick Jelliffe of the Academia Sinica Computing Centre and defines validation rules using XPath expressions.
Starting from the observation that instance documents are usually much easier to understand than the schemas which are describing them, and that schema languages often need to give examples of instance documents to help human readers to understand their syntax, Examplotron (EG) was proposed in March 2001 by Eric van der Vlist to define "schemas by example" using sample instance documents as actual schemas.