Menu

Understanding XML Schemas

July 1, 1999

Norman Walsh

Editor's note: since the publication of this article the W3C has made significant progress on the XML Schema specification. For an updated reference please see Using W3C XML Schema, published on XML.com November 29, 2000.

Introduction

W3C's Schema Working Draft
6 May 1999

The Schema WD is published in two parts: Part 1: Structures and Part 2: Datatypes (more about each of these in a moment). Note, however, that the WG begins each of these documents with the forthright statement that they are expected to change in substantial ways. At this stage in the game, what's important is to understand the goals and motivations for XML Schemas. Don't sweat the details.

In May, the XML Schema Working Group (WG) published its first Working Draft (WD). Schemas will have a broad impact on the future of XML for two reasons: first because they will define what it means for an XML document to be valid and second because they are a radical departure from Document Type Definitions (DTDs), the existing schema mechanism inherited from SGML.

In this article, I'll explore what schemas are, what validity means, how schemas differ from DTDs, and what new functionality will be gained from adopting them. I'll be using the XML Schemas WD from 6 May 1999 to frame the discussion and as the source for concrete examples.

The following sections cover specific topics in more detail. The sections are independent, so you can read them in whatever order suits you.

Schemas

A schema is a model for describing the structure of information. It's a term borrowed from the database world to describe the structure of data in relational tables. In the context of XML, a schema describes a model for a whole class of documents. The model describes the possible arrangement of tags and text in a valid document. A schema might also be viewed as an agreement on a common vocabulary for a particular application that involves exchanging documents.

Schemas may sound a little technical, but we use them to analyze the world around us. For example, suppose I ask you, "is this a valid postal address?"


<address>

<name>Namron H. Slaw</name>

<street>256 Eight Bit Lane</street>

<city>East Yahoo</city>

<state>MA</state>

<zip>12481-6326</zip>

</address>

Mentally, you compare the address presented with a schema that you have in your head for addresses. It probably goes something like this: a postal address consists of a person, possibly at a company or organization, one or more lines of street address, a city, a state or province, a postal code, and an optional country. So, yes, this address is valid.

In schemas, models are described in terms of constraints. A constraint defines what can appear in any given context. There are basically two kinds of constraints that you can give: content model constraints describe the order and sequence of elements and datatype constraints describe valid units of data.

For example, a schema might describe a valid <address> with the content model constraint that it consist of a <name> element, followed by one or more <street> elements, followed by exactly one <city>, <state>, and <zip> element. The content of a <zip> might have a further datatype constraint that it consist of either a sequence of exactly five digits or a sequence of five digits, followed by a hyphen, followed by a sequence of exactly four digits. No other text is a valid ZIP code.

The purpose of a schema is to allow machine validation of document structure. Every specific, individual document which doesn't violate any of the constraints of the model is, by definition, valid according to that schema.

Using the schema described (informally) above, a parser would be able to detect that the following address is not valid:


<address>

<name>Namron H. Slaw</name>

<street>256 Eight Bit Lane</street>

<city>East Yahoo</city>

<state>MA</state>

<state>CT</state>

<zip>blue</zip>

</address>

It violates two constraints of our schema: it does not contain exactly one <state> and the ZIP code is not of the proper form. A formal definition of this schema for addresses is presented in the syntax section.

The ability to test the validity of documents is going to be an important aspect of large web applications that are receiving and sending information to and from lots of sources. If you're receiving XML transactions over the web, you don't want to process the content into your database if it's not in the proper schema. The earlier, and easier it is, to catch this sort of error, the better off you'll be. (You wouldn't want to issue someone a refund check because you allowed them to order -4 hammers, would you?)

DTDs

XML inherited Document Type Definitions (DTDs) from SGML. DTDs are the schema mechanism for SGML. XML Schemas are the first wide-spread attempt to replace DTDs with something "better".

DTDs can be used to define content models (the valid order and nesting of elements) and, to a limited extent, the datatypes of attributes, but they have a number of obvious limitations:

  • They are written in a different (non-XML) syntax.

  • They have no support for namespaces.

  • They only offer extremely limited datatyping. DTDs can only express the datatype of attributes in terms of explicit enumerations and a few coarse string formats, there's no facility for describing numbers, dates, currency values, and so forth. Furthermore, DTDs have no ability to express the datatype of character data in elements.

  • They have a complex and fragile extension mechanism based on little more than string substitution.

    The worst thing about the DTD extension mechanism (parameter entities) is that it doesn't really make relationships explicit. Two elements defined to have the same content models aren't the same thing in any explicit way. Likewise, a group of attributes defined as a parameter entity and reused aren't logically a group, they're just "coincidentally" a group.

XML Schema overcome these limitations and are much more expressive than DTDs. The additional expressiveness will allow web applications to exchange XML data much more robustly without relying on ad hoc validation tools.

Although XML Schema is poised to replace DTDs, in the short term DTDs still have a number of advantages:

  • Widespread tools support. All SGML tools and many XML tools can process DTDs.

  • Widespread deployment. A large number of document types are already defined using DTDs: HTML, XHTML, DocBook, TEI, J2008, CALS, etc.

  • Widespread expertise and many years of practical application.

Warts and all, DTDs are well understood by a large community of SGML and XML programmers and consultants.

Features

XML Schema offer a range of new features.

  • Richer datatypes. Part 2 of the Schema draft defines booleans, numbers, dates and times, URIs, integers, decimal numbers, real numbers, intervals of time, etc.

    In addition to these simple, predefined types, there will be facilities for creating other types and aggregate types (although the mechanisms have not been finalized as of the 06 May 1999 draft).

  • User defined types, called Archetypes in the draft. An archetype allows you to define your own named datatype. For example, you might define a "PostalAddress" datatype and then define two elements, "ShippingAddress" and "BillingAddress" to be of that type.

    This is a more powerful than simply defining the two elements to have the same structure because the shared archetype information is available to the processor.

  • Attribute grouping. It's not uncommon to have several attributes that "go together". For example, common attributes that apply to all elements or several attributes that augment graphic or table elements. Attribute grouping allows the schema author to make this relationship explicit. In DTDs, the grouping can be achieved with a parameter entity, simplifying the process of authoring a DTD, but the information is not passed on to the processor.

  • Refinable archetypes, or "inheritance". This is probably the most significant new feature in XML Schemas.

    A content model defined by a DTD can be described as "closed": it describes all and only what may appear in the content of the element. XML Schema admit two other possibilities: "open" and "refinable". In an open content model, all required elements must be present, but it is not an error for additional elements to also be present. A refinable content model is the middle ground: additional elements may be present, but only if the schema defines what they are. (Consider a schema that extends another: it might refine the content model of some element type to add new elements.)

  • Namespace support. Since the introduction of Namespaces in XML, validation has become much more difficult. In fact, until the XML Schema work is completed, it just isn't practical to validate documents that use namespaces.

    The XML Schema WD describes mechanisms for schema composition (allowing schemas for multiple namespaces to be combined in a rational way so that validation can be performed) and support for namespaces.

Validity

Saying that a document is "valid" means that it fits within the described model of a class of documents. There are many reasons why you might want to make sure your documents are valid:

  • You're doing electronic commerce and you want to know that the purchase order you just received is exactly what you expect: it's not missing anything, it doesn't have anything extra, and everything that it does have is the right datatype (quantities are all positive numbers, prices are all decimal numbers with two digits after the decimal point, and so forth).

  • You're setting up some business-to-business process with another company. You've agreed to share information from your respective corporate databases, but they aren't quite identical. If you recieve a record from your partner's database via XML, you want to be sure that it's valid before you hand it off to the conversion tool that will insert it into your database. Invalid transactions should be rejected immediately so that there's no possibility of bad data slipping into your database.

  • The XML document you're constructing is going to control some overnight batch process and you want to make sure that the instructions you're sending are ones the processor is going to understand. You don't want the process to stop at 2:00am because you forgot to include some required information.

  • You've got a 1000 XML documents that you want to publish on a CD-ROM. You want to be confident that your stylesheet will present each of them correctly without proofing each and every one by hand. If you know that you're stylesheet handles all of the valid constructions in your schema, then you know it'll do the right thing if all your documents conform to the schema.

Using a schema and a validating parser offers one standard way to test your documents. (Valid documents can still be semantically wrong: you can submit a purchase order that asks for a hundred boxes of staples when you meant to ask for ten, but checking validity catches a lot of "obvious" errors.)

Every document that you encounter can be defined in one of four ways:

  1. If it is not well-formed, it isn't XML.

  2. If an XML document does not identify a schema to which it claims to conform (and no schema can be inferred), then it is simply well-formed.

  3. If a schema is (or can be) associated with a document, and the document does not fit within the model described by that schema, it is well-formed but not valid.

  4. If a schema is (or can be) associated with a document, and the document does not violate any of the constraints of that schema, it is well-formed and valid.

There are basically two kinds of validity which most people expect schemas to be able to test: the validity of content models and the validity of specific units of data.

Content Model Validity

Content model validity tests whether the order and nesting of tags is correct. Part 1 of the XML Schema WD defines how a schema indicates the correct order and nesting of elements.

An address, for example, might be defined as having a required <name> tag, one or more <street> tags, a required <city>, a required <state>, a required <zip>, and an optional <country> tag.

In XML Schema syntax, the content model of an address could be described like this:

<elementType name="address">

  <sequence>

    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>

    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="country" minOccur="0" maxOccur="1"/>

  </sequence>

</elementType>

(For a description of the syntax of XML Schema, see the syntax section.)

If you encounter a address that doesn't meet these criteria, it isn't valid (according to the address schema).

Datatype Validity

Datatype validity is the ability to test whether specific units of information are of the correct type and fall within the specified legal values.

For example, if I am writing a schema for catalog order forms, I should be able to express the constraint that the quantity ordered is greater than zero. An order form isn't valid if the quantity of an item ordered is "-5" or "blue".

The ability to express datatype validity in a schema is one of the really new features of XML Schema. Although database schema have always had this ability, XML DTDs do not. DTDs have extremly limited datatyping.

Syntax

XML Schema documents are XML documents. This means that they use elements and attributes to express the semantics of the schema and that they can be edited and processed with the same tools that you use to process other XML documents.

The vocabulary of an XML Schema document is comprised of about thirty elements and attributes. (In a somewhat recursive manner, XML Schema documents are valid only if they conform to the schema for XML Schema. There is also a DTD for XML Schema.)

At bottom, a schema describes the content of elements and attributes, so let's begin with a simple example. Going back to the address example, a <name> could be defined this way in an XML Schema:

Example 1. The name Element Type
<elementType name="name">

  <mixed/>

</elementType>

This example defines the name element type. The term "element type" is used, rather than "element" to distinguish between the type of the thing and the thing itself. In practice, this distinction is usually fairly obvious.

The content of the <elementType> element defines the valid content of an element of that type. In this case, the content type is <mixed/>, meaning that the element can contain a mixture of character data and elements (since no elements are actually included in the definition, it can only contain character data).

The datatyping power of XML Schema can be seen in the declaration for <zip>. We begin by definining a zipCode datatype which is a string that can contain either exactly five digits or exactly five digits followed by a hyphen followed by exactly four digits:

Example 2. A ZIP Code Datatype
<datatype name="zipCode">

  <basetype name="string"/>

  <lexicalRepresentation>

    <lexical>99999</lexical>

    <lexical>99999-9999</lexical>

  </lexicalRepresentation>

</datatype>

This example uses "pictures" to define the datatype, but XML Schemas include a variety of other mechanisms allowing you to easily declare numbers with specific bounds and precision, dates, times, and so forth.

With the zipCode datatype defined, it's now a simple matter to declare that a <zip> must be of that type:

Example 3. The zip Element Type
<elementType name="zip">

  <datatypeRef name="zipCode"/>

</elementType>

In a DTD, there is no way to express these sorts of constraints. The best we could do would be to say that a <zip> contained character data, just like <name>.

In our schema, we can build on these basic types to define aggregate element types like <address>:

Example 4. An Address in Schema Notation
<elementType name="address">

  <sequence>

    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>

    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>

    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>

  </sequence>

</elementType>

This element type is a little different from the preceding ones; it defines the content of the <address> element in terms of other elements. It begins with a <sequence>. A sequence is like the "," separator in DTD syntax, it indicates that the things inside the sequence must occur in the order given. Inside the sequence we see references to other element types. Each element type so referenced must have a corresponding <elementType> declaration elsewhere in the DTD.

The occurance qualifiers indicate how often each element may occur. A minimum occurance of zero makes the element optional. These indicators serve the same purpose as the "?", "*", and "+" qualifiers in DTD syntax, but they are more flexible since both minimum and maximum values may be specified.

The equivalent <address> declaration in DTD synatax looks like this:

Example 5. An Address in DTD Notation
<!ELEMENT address 

          (company?, name, street+, city, state, zip)>

Only that isn't quite equivalent because you can put in as many "street" elements as you want whereas the XML Schema version allows only one or two. It would be possible to get this effect in DTD syntax (street, street?), but it quickly becomes tedious (consider the case where you want between 5 and 50 occurances).

Suppose you wanted to have several addresses. Using DTD syntax, you'd create a parameter entity and then use that:

Example 6. An Address with Parameter Entities

<!ENTITY % address 

    "company?, name, street+, city, state, zip">



<!ELEMENT billing.address (%address;)>

<!ELEMENT shipping.address (%address;)>

In an XML Schema, you'd use an archetype:

Example 7. An Address Archetype in XML Schema

<archetype name="address" model="refinable">

  <sequence>

    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>

    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>

    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>

  </sequence>

</archetype>



<elementType name="billing.address">

  <archetypeRef name="address"/>

</elementType>



<elementType name="shipping.address">

  <archetypeRef name="address"/>

</elementType>

This example demonstrates two significant advantages of an archetype:

  1. The archetype is refinable. This means that I can derive new, related address types from it. I could create, for example, a return address that included everything in an address but added an element to hold the RMA (return merchandise authorization) number.

  2. The relationship that a billing.address is an address and a shipping.address is an address is explicit. In the DTD case, the parser expands the parameter entities and you get what amounts to this:

    
    <!ELEMENT billing.address (company?, name, street+, city, state, zip)>
    
    <!ELEMENT shipping.address (company?, name, street+, city, state, zip)>
    
    

    With a complex enough content model, you can't immediately tell that two elements are the same. And there's no way for the parser to know if they're the same because they're really the same, or if they're the same just by coincidence.

For comparison, here's a more complete example. Example 8. "A Purchase Order" shows a sample document, an XML purchase order:

Example 8. A Purchase Order
<!DOCTYPE purchase.order SYSTEM "po.dtd">



<purchase.order>



<date>16 June 1967</date>



<billing.address>

  <name>Namron H. Slaw</name>

  <street>256 Eight Bit Lane</street>

  <city>East Yahoo</city>

  <state>MA</state>

  <zip>12481-6326</zip>

</billing.address>



<items>

  <item>

    <quantity>3</quantity>

    <product.number>248</product.number>

    <description>Decorative Widget, Red, Large</description>

    <unitcost>19.95</unitcost>

  </item>

  <item>

    <quantity>1</quantity>

    <product.number>1632</product.number>

    <description>Packed electron storage container, AA, 4-pack</description>

    <unitcost>4.95</unitcost>

  </item>

</items>



</purchase.order>



The rest of this section examines the schema for this document type in more detail. The text of the schema is available along with an equivalent DTD.

Example 9. A Schema for Purchase Orders
<!DOCTYPE schema SYSTEM "o:/reference/w3c/schema/structures.dtd">



<schema>

Since I don't have a schema processor, I'm using the schema DTD to validate my schema. All schema begin with <schema>.

<archetype name="address" model="refinable">

  <sequence>

    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>

    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>

    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>

  </sequence>

</archetype>

As discussed above, I define an architype for addresses so that I can use it to define address elements.

<elementType name="billing.address">

  <archetypeRef name="address"/>

</elementType>



<elementType name="shipping.address">

  <archetypeRef name="address"/>

</elementType>

Now that I've got an archetype, I use it to define <billing.address> and <shipping.address>.

<elementType name="items">

  <elementTypeRef name="item" minOccur="1"/>

</elementType>

The <items> element is just a wrapper for one or more <item> elements.

<elementType name="item">

  <sequence>

    <elementTypeRef name="quantity" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="product.number" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="description" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="unitcost" minOccur="1" maxOccur="1"/>

  </sequence>

</elementType>

Each <item> contains exactly one <quantity>, <product.number>, <description>, and <unitcost>.

<elementType name="purchase.order">

  <sequence>

    <elementTypeRef name="date" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="billing.address" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="shipping.address" minOccur="0" maxOccur="1"/>

    <elementTypeRef name="items" minOccur="1" maxOccur="1"/>

  </sequence>

</elementType>

Similarly, the <purchase.order> consists of a date, billing address, shipping address, and a number of items.

<elementType name="company">

  <mixed/>

</elementType>



<elementType name="name">

  <mixed/>

</elementType>



<elementType name="street">

  <mixed/>

</elementType>



<elementType name="city">

  <mixed/>

</elementType>



<elementType name="state">

  <mixed/>

</elementType>



<datatype name="zipCode">

  <basetype name="string"/>

  <lexicalRepresentation>

    <lexical>99999</lexical>

    <lexical>99999-9999</lexical>

  </lexicalRepresentation>

</datatype>



<elementType name="zip">

  <datatypeRef name="zipCode"/>

</elementType>



<elementType name="product.number">

  <mixed/>

</elementType>



<elementType name="description">

  <mixed/>

</elementType>

Most of the address elements are just character data and the ZIP code is defined as described earlier. The <product.number> and <description> are also just character data.

<datatype name="quantityType">

  <basetype name="integer"/>

  <minExclusive>0</minExclusive>

</datatype>



<elementType name="quantity">

  <datatypeRef name="quantityType"/>

</elementType>

The content of the <quantity> element is defined to be an integer larger than zero.

<datatype name="currency">

  <basetype name="decimal"/>

  <precision>8</precision>

  <scale>2</scale>

</datatype>



<elementType name="unitcost">

  <datatypeRef name="currency"/>

</elementType>

For the <unitcost>, it's important that the data entered represent a reasonable price. In this case, I've chosen to allow prices to be up to eight digits long with two digits after the decimal point. That's enough for a million dollar order!

<elementType name="date">

  <datatypeRef name="dateTime"/>

</elementType>



</schema>

Finally, the <date> element uses the builtin dateTime type, and the schema ends with </schema>.

Conclusion

Looking at the scope and functionality that schemas will provide, they seem like a great improvement over DTDs. Certain kinds of applications, exchanging information between databases, for example, and ecommerce are clearly going to be made simpler and more interoperable by XML Schema.

As I see it, the primary virtue of DTDs today is that they are well understood and they do offer a good way to describe the structure of an document for interchange. It will take some time before XML Schema are as well understood. Until then, we'll be "flying without a net" to a certain extent, waiting for the final standard and practical, documented methodologies for schema creation to follow.