Menu

Comparing XML Schema Languages

December 12, 2001

Eric van der Vlist

Table of Contents

What is an XML Schema Language?

XML Schemas Languages Considered Harmful?

What is Validation?

A Short History of XML Schema Languages

Our Sample Application

DTDs

W3C XML Schema Definition Language

Examplotron

Mix and Match

Comparisons

References

This article explains what an XML schema language is and which features the different schema languages possess. It also documents the development of the major schema language families -- DTDs, W3C XML Schema, and RELAX NG -- and compares the features of DTDs, W3C XML Schema, RELAX NG, Schematron, and Examplotron.

What is an XML Schema Language?

In ordinary English, a schema is defined as "an outline or image universally applicable to a general conception, under which it is likely to be presented to the mind; as, five dots in a line are a schema of the number five; a preceding and succeeding event are a schema of cause and effect" (Websters).

The English language definition of schema does not really apply to XML schema languages. Most of the schema languages are too complex to "present to the mind" or to a program the instance documents that they describe, and, more importantly and less subjectively, they often focus on defining validation rules more than on modeling a class of documents.

All XML schema languages define transformations to apply to a class of instance documents. XML schemas should be thought of as transformations. These transformations take instance documents as input and produce a validation report, which includes at least a return code reporting whether hhe document is valid and an optional Post Schema Validation Infoset (PSVI), updating the original document's infoset (the information obtained from the XML document by the parser) with additional information (default values, datatypes, etc.)

One important consequence of realizing that XML schemas define transformations is that one should consider general purpose transformation languages and APIs as alternatives when choosing a schema language.

XML Schemas Languages Considered Harmful?

Before we dive into the features of XML schema languages, I'd like to step back and look at the downsides of the use of any schema language.

One of the key strengths of XML, sometimes called "late binding," is the decoupling of the writer and the reader of an XML document: this gives the reader the ability to have its own interpretation and understanding of the document. By being more prescriptive about the way to interpret a document, XML schema languages reduce the possibility of erroneous interpretation but also create the possibility of unexpectedly adding "value" to the document by creating interpretations not apparent from an examination of the document itself.

Furthermore, modeling an XML tree is very complex, and the schema languages often make a judgment on "good" and "bad" practices in order to limit their complexity and consequent validation processing times. Such limitations also reduce the set of possibilities offered to XML designers. Reducing the set of possibilities offered by a still relatively young technology, that is, premature optimization, is a risk, since these "good" or "bad" practices are still ill-defined and rapidly evolving.

The many advantages of using and widely distributing XML schemas must be balanced against the risk of narrowing the flexibility and extensibility of XML.

What is Validation?

A document conforming to a particular schema is said to be valid, and the process of checking that conformance is called validation. We can differentiate between at least four levels of validation enabled by schema languages:

  • The validation of the markup -- controlling the structure of a document.

  • The validation of the content of individual leaf nodes (datatyping)

  • The validation of integrity, i.e. of the links between nodes within a document or between documents.

  • Any other tests (often called "business rules").

Validating markup and datatypes are the most powerful (or most dangerous, since they often imply a kind of modeling which limits diversity of the markup and datatypes). Link validation, especially between different documents, is poorly covered by the current schema languages.

A Short History of XML Schema Languages

The complete list of markup schema languages is long and would need to include languages developed for SGML to be complete. The list which I propose below is not exhaustive, and it includes only the major proposals that have influenced the schema languages covered in this article.

The DTD Family

A simplified version of SGML DTDs was introduced in the XML 1.0 Recommendation (XML) . Even though a DTD is not mandatory for an application to read and understand a XML document, many developers recommend writing DTDs for your XML applications.

The W3C XML Schema Family

The W3C XML Schema Working Group received many proposals contributed as notes:

  1. XML-Data, submitted as a note (XML-Data) in January 1998 by Microsoft, DataChannel, Arbortext, Inso Corporation, and University of Edinburgh, included most of the basic concepts developed by W3C XML Schema. Although the details were not fully developed, the note covered a lot of ground which has been kept out of W3C XML Schema, such as internal and external entity definitions and the mapping to RDF (Resource Description Framework) and OOP structures.

  2. XML-Data-Reduced (XDR), submitted in July 1998 (XDR) by Microsoft and University of Edinburgh was presented to "refine and subset those ideas down to a more manageable size in order to allow faster progress toward adopting a new schema language for XML" (mappings were left out). XDR has been implemented by Microsoft and used by the BizTalk framework.

  3. DCD (Document Content Description for XML), also submitted in July 1998 (DCD) by Textuality, Microsoft, and IBM was a "subset of the XML-Data Submission (XML-Data) and expresses it in a way which is consistent with the ongoing W3C RDF (Resource Description Framework) effort". Mapping considerations were left out, but the language took care to be consistent with RDF through features such as "Interchangeability of Elements and Attributes."

  4. SOX (Schema for Object-Oriented XML) was developed by Veo Systems/Commerce One and submitted as a note in September 1998 (a second version was submitted in July 1999 (SOX) as "informed by the XML 1.0 specification as well as the XML-Data submission (XML-Data), the Document Content Description submission (DCD) and the EXPRESS language reference manual (ISO-10303-11)". SOX was very influenced by OOP language design and included concepts of interface and implementation, but it was also influenced by DTDs and also included a support for "parameters". SOX has been widely used by Commerce One.

  5. DDML (Document Definition Markup Language or XSchema) was the "result of contributions from a large number of people on the XML-Dev mailing list, coordinated by a smaller group of editors" (Ronald Bourret , John Cowan, Ingo Macherius, and Simon St. Laurent) and was submitted as a note in January 1999 (DDML). Its purpose was to "encode the logical (as opposed to physical) content of DTDs in an XML document". Great attention had been paid to the definition of the back and forward conversions back between DTDs and DDML, and the document also included an "experimental" chapter proposing "Inline DDML elements". DDML made a clear distinction between structures and data and left datatypes out.

  6. W3C XML Schema, published as a Recommendation in May 2001 (XMLS0, XMLS1, XMLS2) acknowledges the influence of DCD, DDML, SOX, XML-Data, and XDR in its list of references and appears to have picked pieces from each of these proposals but is also a compromise between them. The main sponsors of the two languages still actively used and developed (Microsoft for XDR and Commerce One for SOX) have both announced that they would support W3C XML Schema for their new developments, and W3C XML Schema should become the only surviving member of this family.

The RELAX NG Family

  1. First published in March 2000 as a Japanese ISO Standard Technical Report written by Murata Makoto, Regular Language description for XML Core (RELAX) (RLX) is both simple ("Tired of complicated specifications? You just RELAX !") and built on a solid mathematical foundation (the adaptation of the hedge automata theory to XML trees). It was approved as an ISO/IEC Technical Report in May 2001.

  2. XDuce (XDUCE) was first announced in March 2000: "XDuce ('transduce') is a typed programming language that is specifically designed for processing XML data. One can read an XML document as an XDuce value, extract information from it or convert it to another format, and write out the result value as an XML document". Although its purpose is not to be a schema language, its typing system has influenced the schema languages.

  3. Published by James Clark in January 2001, TREX (Tree Regular Expressions for XML) (TREX) is "basically the type system of XDuce with an XML syntax and with a bunch of additional features". The names and content model of the elements used to define the tree patterns of a TREX schema have been carefully chosen, and TREX schemas are usually as easy to read as a plain text description. The simplicity of the structure of the language also allows the resurrection of a consistent treatment between elements and attributes, a feature lost since DCD.

  4. Announced in May 2001, RELAX NG (RELAX New Generation) is the result of a merger of RELAX and TREX, developed by an OASIS TC (RNG) and coedited by James Clark and Murata Makoto: "The key features of RELAX NG are that it is simple, easy to learn, uses XML syntax, does not change the information set of an XML document, supports XML namespaces, treats attributes uniformly with elements so far as possible, has unrestricted support for unordered content, has unrestricted support for mixed content, has a solid theoretical basis, and can partner with a separate datatyping language (such W3C XML Schema Datatypes)". RELAX NG is now an official specification of the OASIS RELAX NG Technical Committee and will probably progress to become an ISO/IEC TR.

Schematron

Nontypical of schema languages, Schematron (SCH) was first proposed in September 1999 by Rick Jelliffe of the Academia Sinica Computing Centre and defines validation rules using XPath expressions.

Examplotron

Starting from the observation that instance documents are usually much easier to understand than the schemas which are describing them, and that schema languages often need to give examples of instance documents to help human readers to understand their syntax, Examplotron (EG) was proposed in March 2001 by Eric van der Vlist to define "schemas by example" using sample instance documents as actual schemas.

Our Sample Application

In the remainder of this article, I will be using the following simple library application to illustrate the use of the various schema languages.


<?xml version="1.0" encoding="utf-8"?>

<library>

  <book id="_0836217462">

    <isbn>0836217462</isbn>

    <title>Being a Dog Is a Full-Time Job</title>

    <author-ref id="Charles-M.-Schulz"/>

    <character-ref id="Peppermint-Patty"/>

    <character-ref id="Snoopy"/>

    <character-ref id="Schroeder"/>

    <character-ref id="Lucy"/>

  </book>

  <book id="_0805033106">

    <isbn>0805033106</isbn>

    <title>Peanuts Every Sunday </title>

    <author-ref id="Charles-M.-Schulz"/>

    <character-ref id="Sally-Brown"/>

    <character-ref id="Snoopy"/>

    <character-ref id="Linus"/>

    <character-ref id="Snoopy"/>

  </book>

  <author id="Charles-M.-Schulz">

    <name>Charles M. Schulz</name>

    <nickName>SPARKY</nickName>

    <born>November 26, 1922</born>

    <dead>February 12, 2000</dead>

  </author>

  <character id="Peppermint-Patty">

    <name>Peppermint Patty</name>

    <since>Aug. 22, 1966</since>

    <qualification>bold, brash and tomboyish</qualification>

  </character>

  <character id="Snoopy">

    <name>Snoopy</name>

    <since>October 4, 1950</since>

    <qualification>extroverted beagle</qualification>

  </character>

  <character id="Schroeder">

    <name>Schroeder</name>

    <since>May 30, 1951</since>

    <qualification>

      brought classical music to the Peanuts strip

    </qualification>

  </character>

  <character id="Lucy">

    <name>Lucy</name>

    <since>March 3, 1952</since>

    <qualification>bossy, crabby and selfish</qualification>

  </character>

  <character id="Sally-Brown">

    <name>Sally Brown</name>

    <since>Aug, 22, 1960</since>

    <qualification>always looks for the easy way out</qualification>

  </character>

  <character id="Linus">

    <name>Linus</name>

    <since>Sept. 19, 1952</since>

    <qualification>the intellectual of the gang</qualification>

  </character>

</library>

DTDs

Overview

Author: W3C
Status: Recommendation ("embedded" in XML 1.0)
Location: http://www.w3.org/TR/REC-xml
PSVI: Yes
Structures: Yes
Datatypes: Yes (weak)
Integrity: Yes (internal through ID/IDREF/IDREFS attributes)
Rules: No
Vendor support: Excellent
Miscellaneous: Non-XML syntax; no support for namespaces. schema definition is only one of the features of DTDs.

Inherited from SGML, the XML DTD is the most widely deployed means of defining an XML schema. Defined in the XML 1.0 Recommendation, DTD does not support namespaces, which were specified later. This, together with the fact that its datatype system is weak and only applies to attributes, is one of the main motivations for the W3C to develop a new schema language.

The DTD for our sample could be


<?xml version="1.0" encoding="UTF-8"?>

<!ELEMENT author (name, nickName, born, dead)>

<!ATTLIST author

  id ID #REQUIRED

>

<!ELEMENT author-ref EMPTY>

<!ATTLIST author-ref

  id IDREF #REQUIRED

>

<!ELEMENT book (isbn, title, author-ref*, character-ref*)>

<!ATTLIST book

  id ID #REQUIRED

>

<!ELEMENT born (#PCDATA)>

<!ELEMENT character (name, since, qualification)>

<!ATTLIST character

  id ID #REQUIRED

>

<!ELEMENT character-ref EMPTY>

<!ATTLIST character-ref

  id IDREF #REQUIRED

>

<!ELEMENT dead (#PCDATA)>

<!ELEMENT isbn (#PCDATA)>

<!ELEMENT library (book+, author*, character*)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT nickName (#PCDATA)>

<!ELEMENT qualification (#PCDATA)>

<!ELEMENT since (#PCDATA)>

<!ELEMENT title (#PCDATA)>

W3C XML Schema Definition Language

Overview

Author:

W3C

Status:

Recommendation

Location:

http://www.w3.org/TR/xmlschema-0/

PSVI:

Yes

Structures:

Yes

Datatypes:

Yes

Integrity:

Yes (internal through ID/IDREF/IDREFS and xs:unique/xs:key/xs:keyref)

Rules:

No

Vendor support:

Potentially excellent but currently still immature.

Miscellaneous:

Borrows many ideas from OOP design; considered complex; paranoid about determinism; part of the foundation of XML in the vision of the W3C.



W3C XML Schema was published by the W3C to provide an alternative to XML DTD that supported namespaces, facilitates the design of open and extensible vocabularies, and meets the requirement of data-oriented applications for a richer datatyping system. It does so by borrowing many features from OOP languages, and hence the fit with the tree structure of XML documents is sometimes difficult to make. It is generally considered complex, partly because of the number of features, and partly because of the style of the recommendation which describes the validation process more than the modeling features.

W3C XML Schema is a strongly typed schema language that eliminates any non-deterministic design from the described markup to insure that there is no ambiguity in the determination of the datatypes and that the validation can be made by a finite state machine.

A W3C XML Schema schema for our sample could be


<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="library">

    <xs:complexType>

      <xs:sequence>

        <xs:element name="book" maxOccurs="unbounded">

          <xs:complexType>

            <xs:sequence>

              <xs:element name="isbn" type="xs:string"/>

              <xs:element name="title" type="xs:string"/>

              <xs:element name="author-ref" minOccurs="0" maxOccurs="unbounded">

                <xs:complexType>

                  <xs:attribute name="id" type="xs:IDREF" use="required"/>

                </xs:complexType>

              </xs:element>

              <xs:element name="character-ref" minOccurs="0" maxOccurs="unbounded">

                <xs:complexType>

                  <xs:attribute name="id" type="xs:IDREF" use="required"/>

                </xs:complexType>

              </xs:element>

            </xs:sequence>

            <xs:attribute name="id" type="xs:ID" use="required"/>

          </xs:complexType>

        </xs:element>

        <xs:element name="author" minOccurs="0" maxOccurs="unbounded">

          <xs:complexType>

            <xs:sequence>

              <xs:element ref="name"/>

              <xs:element name="nickName" type="xs:string"/>

              <xs:element name="born" type="xs:string"/>

              <xs:element name="dead" type="xs:string"/>

            </xs:sequence>

            <xs:attribute name="id" type="xs:ID" use="required"/>

          </xs:complexType>

        </xs:element>

        <xs:element name="character" minOccurs="0" maxOccurs="unbounded">

          <xs:complexType>

            <xs:sequence>

              <xs:element ref="name"/>

              <xs:element name="since" type="xs:string"/>

              <xs:element name="qualification" type="xs:string"/>

            </xs:sequence>

            <xs:attribute name="id" type="xs:ID" use="required"/>

          </xs:complexType>

        </xs:element>

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="name" type="xs:string"/>

</xs:schema>

RELAX NG

Overview

Author:

OASIS and possibly ISO

Status:

OASIS RELAX NG Comittee Specification

Location:

http://relaxng.org/

PSVI:

No

Structures:

Yes

Datatypes:

No, but a modular mechanism has been defined to plug in datatype systems (W3C XML Schema part 2 and others if needed).

Integrity:

No (except through ID/IDREF/IDREFS features of a datatype system)

Rules:

No

Vendor support:

To be seen.

Miscellaneous:

Result of the merge between RELAX and TREX, might become an ISO TR. Strong mathematical grounding. Alternate non-XML syntax proposed by James Clark.



Its editors (James Clark and Murata Makoto) define RELAX NG as "the next generation schema language for XML: clean, simple and powerful". RELAX NG appears to be closer to a description of the instance documents in ordinary English and simpler than W3C XML Schema, to which it might become a serious alternative.

Many constraints, especially those which are on the fringe of non-deterministic models, can be expressed by RELAX NG and not by W3C XML Schema. Some combinations in document structures forbidden by W3C XML Schema can be described by RELAX NG.

Even though RELAX NG seems to be technically superior to W3C XML Schema, support by software vendors and XML developers is uncertain now that W3C XML Schema is a Recommendation.

A RELAX NG schema for our sample could be


<?xml version="1.0" encoding="UTF-8"?>

<grammar 

   datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" 

   xmlns="http://relaxng.org/ns/structure/1.0">

  <start>

    <choice>

      <ref name="library"/>

    </choice>

  </start>

  <define name="library">

    <element name="library">

      <oneOrMore>

        <ref name="book"/>

      </oneOrMore>

      <zeroOrMore>

        <ref name="author"/>

      </zeroOrMore>

      <zeroOrMore>

        <ref name="character"/>

      </zeroOrMore>

    </element>

  </define>

  <define name="author">

    <element name="author">

      <attribute name="id">

        <data type="ID"/>

      </attribute>

      <element name="name">

        <text/>

      </element>

      <element name="nickName">

        <text/>

      </element>

      <element name="born">

        <text/>

      </element>

      <element name="dead">

        <text/>

      </element>

    </element>

  </define>

  <define name="book">

    <element name="book">

      <ref name="id-attribute"/>

      <ref name="isbn"/>

      <ref name="title"/>

      <zeroOrMore>

        <element name="author-ref">

          <attribute name="id">

            <data type="IDREF"/>

          </attribute>

          <empty/>

        </element>

      </zeroOrMore>

      <zeroOrMore>

        <element name="character-ref">

          <attribute name="id">

            <data type="IDREF"/>

          </attribute>

          <empty/>

        </element>

      </zeroOrMore>

    </element>

  </define>

  <define name="id-attribute" >

    <attribute name="id">

      <data type="ID"/>

    </attribute>

  </define>

  <define name="character">

    <element name="character">

      <ref name="id-attribute"/>

      <ref name="name"/>

      <ref name="since"/>

      <ref name="qualification"/>

    </element>

  </define>

  <define name="isbn">

    <element name="isbn">

      <text/>

    </element>

  </define>

  <define name="name">

    <element name="name">

      <text/>

    </element>

  </define>

  <define name="nickName">

    <element name="nickName">

      <text/>

    </element>

  </define>

  <define name="qualification">

    <element name="qualification">

      <text/>

    </element>

  </define>

  <define name="since">

    <element name="since">

      <data type="date"/>

    </element>

  </define>

  <define name="title">

    <element name="title">

      <text/>

    </element>

  </define>

</grammar>

Schematron

Overview

Author: Rick Jelliffe and other contributors.
Status: Unofficial
Location: http://www.ascc.net/xml/schematron/
PSVI: No (not directly)
Structures: No (not directly)
Datatypes: No (not directly)
Integrity: No (not directly)
Rules: Yes, through XPath expressions
Vendor support: Low
Miscellaneous: Pure rule expression.

Schematron is an XPath/XSLT-based language for defining context dependent rules. Schematron doesn't directly support structure or datatype validation, but a schema author may write rules which implement these structure and datatype checks. To write a full schema with Schematron, the author needs to take care to include all the rules needed to qualify the structure of the document.

A partial Schematron schema for our sample could be


<?xml version="1.0" encoding="UTF-8"?>

<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">

  <sch:title>Schematron Schema for library</sch:title>

  <sch:pattern>

    <sch:rule context="/">

      <sch:assert test="library">

        The document element should be "library".

        </sch:assert>

    </sch:rule>

    <sch:rule context="/library">

      <sch:assert test="book">

        There should be at least a book!

        </sch:assert>

      <sch:assert test="not(@*)">

        No attribute for library, please!

        </sch:assert>

    </sch:rule>

    <sch:rule context="/library/book">

      <sch:assert test="not(following-sibling::book/@id=@id)">

        Duplicated ID for this book.

        </sch:assert>

      <sch:assert test="@id=concat('_', isbn)">

        The id should be derived from the ISBN.

        </sch:assert>

    </sch:rule>

    <sch:rule context="/library/*">

      <sch:assert test="name()='book' or name()='author' or name()='character'">

        This element shouldn't be here...

        </sch:assert>

    </sch:rule>

  </sch:pattern>

</sch:schema>

Examplotron

Overview

Author: Eric van der Vlist
Status: Unofficial
Location: http://examplotron.org/
PSVI: No (not yet)
Structures: Yes
Datatypes: No (not directly)
Integrity: No (not directly)
Rules: Yes, through XPath expressions
Vendor support: None
Miscellaneous: Schema by example (a sample document is a schema) with rules checking (syntax borrowed to Schematron).

Examplotron is an experiment to define a schema language based on sample trees, not unlike early proposals for XPath. An Examplotron schema for our sample could be:


<?xml version="1.0" encoding="UTF-8"?>

<library xmlns:eg="http://examplotron.org/0/">

  <book id="_0836217462" 

     eg:occurs="+" 

     eg:assert="not(following-sibling::book/@id=@id) and @id=concat('_', isbn)">

    <isbn>0836217462</isbn>

    <title>Being a Dog Is a Full-Time Job</title>

    <author-ref id="Charles-M.-Schulz"  eg:occurs="*"/>

    <character-ref id="Peppermint-Patty"  eg:occurs="*"/>

  </book>

  <author id="Charles-M.-Schulz" eg:occurs="*">

    <name>Charles M. Schulz</name>

    <nickName>SPARKY</nickName>

    <born>November 26, 1922</born>

    <dead>February 12, 2000</dead>

  </author>

  <character id="Peppermint-Patty" eg:occurs="*">

    <name>Peppermint Patty</name>

    <since>Aug. 22, 1966</since>

    <qualification>bold, brash and tomboyish</qualification>

  </character>

</library>

Mix and Match

We have seen that the features of some of these languages are more complementary than overlapping, and there is room for interesting combinations, especially with Schematron and the structure and datatype-based languages.

Some early implementations are available which support the embedding of Schematron rules in xs:annotation/xs:appinfo W3C XML Schema elements. The combination of W3C XML Schema and Schematron enables the use of each language for the purpose for which it was designed: structure and datatype validation for W3C XML Schema, and rules for Schematron. The power of the rules expressed with Schematron can also compensate for he weaknesses of W3C XML Schema.

Discussions are also underway to embed Schematron rules in RELAX NG schemas. This would then lead to a possible combination of RELAX NG for the structure, W3C XML Schema part 2 for the datatypes and Schematron for the rules, which would certainly demonstrate the extensibility of XML applications.

Comparisons

To wrap up, I will summarize the pros and cons of each language.

Tool support (as of today)

  1. Best: DTD

  2. Most promising: W3C XML Schema

  3. Challenger: RELAX NG

  4. Niche: Schematron and Examplotron

Features

  1. Structures: DTD, W3C XML Schema, RELAX NG, Examplotron.

  2. Datatype: W3C XML Schema

  3. Integrity: W3C XML Schema, Schematron, Examplotron

  4. Rules: Schematron, Examplotron

Flexibility (ability to describe a wide range of structures)

  1. Most flexible: Schematron (but everything needs to be defined "by hand").

  2. Most flexible structure-based language: RELAX NG.

  3. Integrity: W3C XML Schema, Schematron, Examplotron

  4. Challenger: Examplotron

  5. Behind: W3C XML Schema

  6. Least flexible: DTD (lack of namespace support)

So What?

There are currently no perfect XML Schema languages. Fortunately, there are a number of good choices, each with strengths and weaknesses, and these choices can be combined. Your job may be as simple as picking the right combination for your application.

References