XML.com 
 Published on XML.com http://www.xml.com/pub/a/1999/07/schemas/index.html
See this if you're having trouble printing code examples

 

Understanding XML Schemas
By Norman Walsh
July 01, 1999

Editor's note: since the publication of this article the W3C has made significant progress on the XML Schema specification. For an updated reference please see Using W3C XML Schema, published on XML.com November 29, 2000.

Introduction

W3C's Schema Working Draft
6 May 1999

The Schema WD is published in two parts: Part 1: Structures and Part 2: Datatypes (more about each of these in a moment). Note, however, that the WG begins each of these documents with the forthright statement that they are expected to change in substantial ways. At this stage in the game, what's important is to understand the goals and motivations for XML Schemas. Don't sweat the details.

In May, the XML Schema Working Group (WG) published its first Working Draft (WD). Schemas will have a broad impact on the future of XML for two reasons: first because they will define what it means for an XML document to be valid and second because they are a radical departure from Document Type Definitions (DTDs), the existing schema mechanism inherited from SGML.

In this article, I'll explore what schemas are, what validity means, how schemas differ from DTDs, and what new functionality will be gained from adopting them. I'll be using the XML Schemas WD from 6 May 1999 to frame the discussion and as the source for concrete examples.

The following sections cover specific topics in more detail. The sections are independent, so you can read them in whatever order suits you.

Schemas

A schema is a model for describing the structure of information. It's a term borrowed from the database world to describe the structure of data in relational tables. In the context of XML, a schema describes a model for a whole class of documents. The model describes the possible arrangement of tags and text in a valid document. A schema might also be viewed as an agreement on a common vocabulary for a particular application that involves exchanging documents.

Schemas may sound a little technical, but we use them to analyze the world around us. For example, suppose I ask you, "is this a valid postal address?"

<address>
<name>Namron H. Slaw</name>
<street>256 Eight Bit Lane</street>
<city>East Yahoo</city>
<state>MA</state>
<zip>12481-6326</zip>
</address>

Mentally, you compare the address presented with a schema that you have in your head for addresses. It probably goes something like this: a postal address consists of a person, possibly at a company or organization, one or more lines of street address, a city, a state or province, a postal code, and an optional country. So, yes, this address is valid.

In schemas, models are described in terms of constraints. A constraint defines what can appear in any given context. There are basically two kinds of constraints that you can give: content model constraints describe the order and sequence of elements and datatype constraints describe valid units of data.

For example, a schema might describe a valid <address> with the content model constraint that it consist of a <name> element, followed by one or more <street> elements, followed by exactly one <city>, <state>, and <zip> element. The content of a <zip> might have a further datatype constraint that it consist of either a sequence of exactly five digits or a sequence of five digits, followed by a hyphen, followed by a sequence of exactly four digits. No other text is a valid ZIP code.

The purpose of a schema is to allow machine validation of document structure. Every specific, individual document which doesn't violate any of the constraints of the model is, by definition, valid according to that schema.

Using the schema described (informally) above, a parser would be able to detect that the following address is not valid:

<address>
<name>Namron H. Slaw</name>
<street>256 Eight Bit Lane</street>
<city>East Yahoo</city>
<state>MA</state>
<state>CT</state>
<zip>blue</zip>
</address>

It violates two constraints of our schema: it does not contain exactly one <state> and the ZIP code is not of the proper form. A formal definition of this schema for addresses is presented in the syntax section.

The ability to test the validity of documents is going to be an important aspect of large web applications that are receiving and sending information to and from lots of sources. If you're receiving XML transactions over the web, you don't want to process the content into your database if it's not in the proper schema. The earlier, and easier it is, to catch this sort of error, the better off you'll be. (You wouldn't want to issue someone a refund check because you allowed them to order -4 hammers, would you?)

DTDs

XML inherited Document Type Definitions (DTDs) from SGML. DTDs are the schema mechanism for SGML. XML Schemas are the first wide-spread attempt to replace DTDs with something "better".

DTDs can be used to define content models (the valid order and nesting of elements) and, to a limited extent, the datatypes of attributes, but they have a number of obvious limitations:

XML Schema overcome these limitations and are much more expressive than DTDs. The additional expressiveness will allow web applications to exchange XML data much more robustly without relying on ad hoc validation tools.

Although XML Schema is poised to replace DTDs, in the short term DTDs still have a number of advantages:

Warts and all, DTDs are well understood by a large community of SGML and XML programmers and consultants.

Features

XML Schema offer a range of new features.

Validity

Saying that a document is "valid" means that it fits within the described model of a class of documents. There are many reasons why you might want to make sure your documents are valid:

Using a schema and a validating parser offers one standard way to test your documents. (Valid documents can still be semantically wrong: you can submit a purchase order that asks for a hundred boxes of staples when you meant to ask for ten, but checking validity catches a lot of "obvious" errors.)

Every document that you encounter can be defined in one of four ways:

  1. If it is not well-formed, it isn't XML.

  2. If an XML document does not identify a schema to which it claims to conform (and no schema can be inferred), then it is simply well-formed.

  3. If a schema is (or can be) associated with a document, and the document does not fit within the model described by that schema, it is well-formed but not valid.

  4. If a schema is (or can be) associated with a document, and the document does not violate any of the constraints of that schema, it is well-formed and valid.

There are basically two kinds of validity which most people expect schemas to be able to test: the validity of content models and the validity of specific units of data.

Content Model Validity

Content model validity tests whether the order and nesting of tags is correct. Part 1 of the XML Schema WD defines how a schema indicates the correct order and nesting of elements.

An address, for example, might be defined as having a required <name> tag, one or more <street> tags, a required <city>, a required <state>, a required <zip>, and an optional <country> tag.

In XML Schema syntax, the content model of an address could be described like this:

<elementType name="address">
  <sequence>
    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>
    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="country" minOccur="0" maxOccur="1"/>
  </sequence>
</elementType>

(For a description of the syntax of XML Schema, see the syntax section.)

If you encounter a address that doesn't meet these criteria, it isn't valid (according to the address schema).

Datatype Validity

Datatype validity is the ability to test whether specific units of information are of the correct type and fall within the specified legal values.

For example, if I am writing a schema for catalog order forms, I should be able to express the constraint that the quantity ordered is greater than zero. An order form isn't valid if the quantity of an item ordered is "-5" or "blue".

The ability to express datatype validity in a schema is one of the really new features of XML Schema. Although database schema have always had this ability, XML DTDs do not. DTDs have extremly limited datatyping.

Syntax

XML Schema documents are XML documents. This means that they use elements and attributes to express the semantics of the schema and that they can be edited and processed with the same tools that you use to process other XML documents.

The vocabulary of an XML Schema document is comprised of about thirty elements and attributes. (In a somewhat recursive manner, XML Schema documents are valid only if they conform to the schema for XML Schema. There is also a DTD for XML Schema.)

At bottom, a schema describes the content of elements and attributes, so let's begin with a simple example. Going back to the address example, a <name> could be defined this way in an XML Schema:

Example 1. The name Element Type
<elementType name="name">
  <mixed/>
</elementType>

This example defines the name element type. The term "element type" is used, rather than "element" to distinguish between the type of the thing and the thing itself. In practice, this distinction is usually fairly obvious.

The content of the <elementType> element defines the valid content of an element of that type. In this case, the content type is <mixed/>, meaning that the element can contain a mixture of character data and elements (since no elements are actually included in the definition, it can only contain character data).

The datatyping power of XML Schema can be seen in the declaration for <zip>. We begin by definining a zipCode datatype which is a string that can contain either exactly five digits or exactly five digits followed by a hyphen followed by exactly four digits:

Example 2. A ZIP Code Datatype
<datatype name="zipCode">
  <basetype name="string"/>
  <lexicalRepresentation>
    <lexical>99999</lexical>
    <lexical>99999-9999</lexical>
  </lexicalRepresentation>
</datatype>

This example uses "pictures" to define the datatype, but XML Schemas include a variety of other mechanisms allowing you to easily declare numbers with specific bounds and precision, dates, times, and so forth.

With the zipCode datatype defined, it's now a simple matter to declare that a <zip> must be of that type:

Example 3. The zip Element Type
<elementType name="zip">
  <datatypeRef name="zipCode"/>
</elementType>

In a DTD, there is no way to express these sorts of constraints. The best we could do would be to say that a <zip> contained character data, just like <name>.

In our schema, we can build on these basic types to define aggregate element types like <address>:

Example 4. An Address in Schema Notation
<elementType name="address">
  <sequence>
    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>
    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>
    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>
  </sequence>
</elementType>

This element type is a little different from the preceding ones; it defines the content of the <address> element in terms of other elements. It begins with a <sequence>. A sequence is like the "," separator in DTD syntax, it indicates that the things inside the sequence must occur in the order given. Inside the sequence we see references to other element types. Each element type so referenced must have a corresponding <elementType> declaration elsewhere in the DTD.

The occurance qualifiers indicate how often each element may occur. A minimum occurance of zero makes the element optional. These indicators serve the same purpose as the "?", "*", and "+" qualifiers in DTD syntax, but they are more flexible since both minimum and maximum values may be specified.

The equivalent <address> declaration in DTD synatax looks like this:

Example 5. An Address in DTD Notation
<!ELEMENT address 
          (company?, name, street+, city, state, zip)>

Only that isn't quite equivalent because you can put in as many "street" elements as you want whereas the XML Schema version allows only one or two. It would be possible to get this effect in DTD syntax (street, street?), but it quickly becomes tedious (consider the case where you want between 5 and 50 occurances).

Suppose you wanted to have several addresses. Using DTD syntax, you'd create a parameter entity and then use that:

Example 6. An Address with Parameter Entities
<!ENTITY % address 
    "company?, name, street+, city, state, zip">

<!ELEMENT billing.address (%address;)>
<!ELEMENT shipping.address (%address;)>

In an XML Schema, you'd use an archetype:

Example 7. An Address Archetype in XML Schema
<archetype name="address" model="refinable">
  <sequence>
    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>
    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>
    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>
  </sequence>
</archetype>

<elementType name="billing.address">
  <archetypeRef name="address"/>
</elementType>

<elementType name="shipping.address">
  <archetypeRef name="address"/>
</elementType>

This example demonstrates two significant advantages of an archetype:

  1. The archetype is refinable. This means that I can derive new, related address types from it. I could create, for example, a return address that included everything in an address but added an element to hold the RMA (return merchandise authorization) number.

  2. The relationship that a billing.address is an address and a shipping.address is an address is explicit. In the DTD case, the parser expands the parameter entities and you get what amounts to this:

    <!ELEMENT billing.address (company?, name, street+, city, state, zip)>
    <!ELEMENT shipping.address (company?, name, street+, city, state, zip)>
    

    With a complex enough content model, you can't immediately tell that two elements are the same. And there's no way for the parser to know if they're the same because they're really the same, or if they're the same just by coincidence.

For comparison, here's a more complete example. Example 8. "A Purchase Order" shows a sample document, an XML purchase order:

Example 8. A Purchase Order
<!DOCTYPE purchase.order SYSTEM "po.dtd">

<purchase.order>

<date>16 June 1967</date>

<billing.address>
  <name>Namron H. Slaw</name>
  <street>256 Eight Bit Lane</street>
  <city>East Yahoo</city>
  <state>MA</state>
  <zip>12481-6326</zip>
</billing.address>

<items>
  <item>
    <quantity>3</quantity>
    <product.number>248</product.number>
    <description>Decorative Widget, Red, Large</description>
    <unitcost>19.95</unitcost>
  </item>
  <item>
    <quantity>1</quantity>
    <product.number>1632</product.number>
    <description>Packed electron storage container, AA, 4-pack</description>
    <unitcost>4.95</unitcost>
  </item>
</items>

</purchase.order>

The rest of this section examines the schema for this document type in more detail. The text of the schema is available along with an equivalent DTD.

Example 9. A Schema for Purchase Orders
<!DOCTYPE schema SYSTEM "o:/reference/w3c/schema/structures.dtd">

<schema>

Since I don't have a schema processor, I'm using the schema DTD to validate my schema. All schema begin with <schema>.

<archetype name="address" model="refinable">
  <sequence>
    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>
    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>
    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>
  </sequence>
</archetype>

As discussed above, I define an architype for addresses so that I can use it to define address elements.

<elementType name="billing.address">
  <archetypeRef name="address"/>
</elementType>

<elementType name="shipping.address">
  <archetypeRef name="address"/>
</elementType>

Now that I've got an archetype, I use it to define <billing.address> and <shipping.address>.

<elementType name="items">
  <elementTypeRef name="item" minOccur="1"/>
</elementType>

The <items> element is just a wrapper for one or more <item> elements.

<elementType name="item">
  <sequence>
    <elementTypeRef name="quantity" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="product.number" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="description" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="unitcost" minOccur="1" maxOccur="1"/>
  </sequence>
</elementType>

Each <item> contains exactly one <quantity>, <product.number>, <description>, and <unitcost>.

<elementType name="purchase.order">
  <sequence>
    <elementTypeRef name="date" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="billing.address" minOccur="1" maxOccur="1"/>
    <elementTypeRef name="shipping.address" minOccur="0" maxOccur="1"/>
    <elementTypeRef name="items" minOccur="1" maxOccur="1"/>
  </sequence>
</elementType>

Similarly, the <purchase.order> consists of a date, billing address, shipping address, and a number of items.

<elementType name="company">
  <mixed/>
</elementType>

<elementType name="name">
  <mixed/>
</elementType>

<elementType name="street">
  <mixed/>
</elementType>

<elementType name="city">
  <mixed/>
</elementType>

<elementType name="state">
  <mixed/>
</elementType>

<datatype name="zipCode">
  <basetype name="string"/>
  <lexicalRepresentation>
    <lexical>99999</lexical>
    <lexical>99999-9999</lexical>
  </lexicalRepresentation>
</datatype>

<elementType name="zip">
  <datatypeRef name="zipCode"/>
</elementType>

<elementType name="product.number">
  <mixed/>
</elementType>

<elementType name="description">
  <mixed/>
</elementType>

Most of the address elements are just character data and the ZIP code is defined as described earlier. The <product.number> and <description> are also just character data.

<datatype name="quantityType">
  <basetype name="integer"/>
  <minExclusive>0</minExclusive>
</datatype>

<elementType name="quantity">
  <datatypeRef name="quantityType"/>
</elementType>

The content of the <quantity> element is defined to be an integer larger than zero.

<datatype name="currency">
  <basetype name="decimal"/>
  <precision>8</precision>
  <scale>2</scale>
</datatype>

<elementType name="unitcost">
  <datatypeRef name="currency"/>
</elementType>

For the <unitcost>, it's important that the data entered represent a reasonable price. In this case, I've chosen to allow prices to be up to eight digits long with two digits after the decimal point. That's enough for a million dollar order!

<elementType name="date">
  <datatypeRef name="dateTime"/>
</elementType>

</schema>

Finally, the <date> element uses the builtin dateTime type, and the schema ends with </schema>.

Conclusion

Looking at the scope and functionality that schemas will provide, they seem like a great improvement over DTDs. Certain kinds of applications, exchanging information between databases, for example, and ecommerce are clearly going to be made simpler and more interoperable by XML Schema.

As I see it, the primary virtue of DTDs today is that they are well understood and they do offer a good way to describe the structure of an document for interchange. It will take some time before XML Schema are as well understood. Until then, we'll be "flying without a net" to a certain extent, waiting for the final standard and practical, documented methodologies for schema creation to follow.

XML.com Copyright © 1998-2006 O'Reilly Media, Inc.