Menu

Syntax

July 1, 1999

Norman Walsh

XML Schema documents are XML documents. This means that they use elements and attributes to express the semantics of the schema and that they can be edited and processed with the same tools that you use to process other XML documents.

The vocabulary of an XML Schema document is comprised of about thirty elements and attributes. (In a somewhat recursive manner, XML Schema documents are valid only if they conform to the schema for XML Schema. There is also a DTD for XML Schema.)

At bottom, a schema describes the content of elements and attributes, so let's begin with a simple example. Going back to the address example, a <name> could be defined this way in an XML Schema:

Example 1. The name Element Type
<elementType name="name">

  <mixed/>

</elementType>

This example defines the name element type. The term "element type" is used, rather than "element" to distinguish between the type of the thing and the thing itself. In practice, this distinction is usually fairly obvious.

The content of the <elementType> element defines the valid content of an element of that type. In this case, the content type is <mixed/>, meaning that the element can contain a mixture of character data and elements (since no elements are actually included in the definition, it can only contain character data).

The datatyping power of XML Schema can be seen in the declaration for <zip>. We begin by definining a zipCode datatype which is a string that can contain either exactly five digits or exactly five digits followed by a hyphen followed by exactly four digits:

Example 2. A ZIP Code Datatype
<datatype name="zipCode">

  <basetype name="string"/>

  <lexicalRepresentation>

    <lexical>99999</lexical>

    <lexical>99999-9999</lexical>

  </lexicalRepresentation>

</datatype>

This example uses "pictures" to define the datatype, but XML Schemas include a variety of other mechanisms allowing you to easily declare numbers with specific bounds and precision, dates, times, and so forth.

With the zipCode datatype defined, it's now a simple matter to declare that a <zip> must be of that type:

Example 3. The zip Element Type
<elementType name="zip">

  <datatypeRef name="zipCode"/>

</elementType>

In a DTD, there is no way to express these sorts of constraints. The best we could do would be to say that a <zip> contained character data, just like <name>.

In our schema, we can build on these basic types to define aggregate element types like <address>:

Example 4. An Address in Schema Notation
<elementType name="address">

  <sequence>

    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>

    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>

    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>

  </sequence>

</elementType>

This element type is a little different from the preceding ones; it defines the content of the <address> element in terms of other elements. It begins with a <sequence>. A sequence is like the "," separator in DTD syntax, it indicates that the things inside the sequence must occur in the order given. Inside the sequence we see references to other element types. Each element type so referenced must have a corresponding <elementType> declaration elsewhere in the DTD.

The occurance qualifiers indicate how often each element may occur. A minimum occurance of zero makes the element optional. These indicators serve the same purpose as the "?", "*", and "+" qualifiers in DTD syntax, but they are more flexible since both minimum and maximum values may be specified.

The equivalent <address> declaration in DTD synatax looks like this:

Example 5. An Address in DTD Notation
<!ELEMENT address 

          (company?, name, street+, city, state, zip)>

Only that isn't quite equivalent because you can put in as many "street" elements as you want whereas the XML Schema version allows only one or two. It would be possible to get this effect in DTD syntax (street, street?), but it quickly becomes tedious (consider the case where you want between 5 and 50 occurances).

Suppose you wanted to have several addresses. Using DTD syntax, you'd create a parameter entity and then use that:

Example 6. An Address with Parameter Entities

<!ENTITY % address 

    "company?, name, street+, city, state, zip">



<!ELEMENT billing.address (%address;)>

<!ELEMENT shipping.address (%address;)>

In an XML Schema, you'd use an archetype:

Example 7. An Address Archetype in XML Schema

<archetype name="address" model="refinable">

  <sequence>

    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>

    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>

    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>

  </sequence>

</archetype>



<elementType name="billing.address">

  <archetypeRef name="address"/>

</elementType>



<elementType name="shipping.address">

  <archetypeRef name="address"/>

</elementType>

This example demonstrates two significant advantages of an archetype:

  1. The archetype is refinable. This means that I can derive new, related address types from it. I could create, for example, a return address that included everything in an address but added an element to hold the RMA (return merchandise authorization) number.

  2. The relationship that a billing.address is an address and a shipping.address is an address is explicit. In the DTD case, the parser expands the parameter entities and you get what amounts to this:

    
    <!ELEMENT billing.address (company?, name, street+, city, state, zip)>
    
    <!ELEMENT shipping.address (company?, name, street+, city, state, zip)>
    
    

    With a complex enough content model, you can't immediately tell that two elements are the same. And there's no way for the parser to know if they're the same because they're really the same, or if they're the same just by coincidence.

For comparison, here's a more complete example. Example 8. "A Purchase Order" shows a sample document, an XML purchase order:

Example 8. A Purchase Order
<!DOCTYPE purchase.order SYSTEM "po.dtd">



<purchase.order>



<date>16 June 1967</date>



<billing.address>

  <name>Namron H. Slaw</name>

  <street>256 Eight Bit Lane</street>

  <city>East Yahoo</city>

  <state>MA</state>

  <zip>12481-6326</zip>

</billing.address>



<items>

  <item>

    <quantity>3</quantity>

    <product.number>248</product.number>

    <description>Decorative Widget, Red, Large</description>

    <unitcost>19.95</unitcost>

  </item>

  <item>

    <quantity>1</quantity>

    <product.number>1632</product.number>

    <description>Packed electron storage container, AA, 4-pack</description>

    <unitcost>4.95</unitcost>

  </item>

</items>



</purchase.order>



The rest of this section examines the schema for this document type in more detail. The text of the schema is available along with an equivalent DTD.

Example 9. A Schema for Purchase Orders
<!DOCTYPE schema SYSTEM "o:/reference/w3c/schema/structures.dtd">



<schema>

Since I don't have a schema processor, I'm using the schema DTD to validate my schema. All schema begin with <schema>.

<archetype name="address" model="refinable">

  <sequence>

    <elementTypeRef name="company" minOccur="0" maxOccur="1"/>

    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>

    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>

  </sequence>

</archetype>

As discussed above, I define an architype for addresses so that I can use it to define address elements.

<elementType name="billing.address">

  <archetypeRef name="address"/>

</elementType>



<elementType name="shipping.address">

  <archetypeRef name="address"/>

</elementType>

Now that I've got an archetype, I use it to define <billing.address> and <shipping.address>.

<elementType name="items">

  <elementTypeRef name="item" minOccur="1"/>

</elementType>

The <items> element is just a wrapper for one or more <item> elements.

<elementType name="item">

  <sequence>

    <elementTypeRef name="quantity" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="product.number" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="description" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="unitcost" minOccur="1" maxOccur="1"/>

  </sequence>

</elementType>

Each <item> contains exactly one <quantity>, <product.number>, <description>, and <unitcost>.

<elementType name="purchase.order">

  <sequence>

    <elementTypeRef name="date" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="billing.address" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="shipping.address" minOccur="0" maxOccur="1"/>

    <elementTypeRef name="items" minOccur="1" maxOccur="1"/>

  </sequence>

</elementType>

Similarly, the <purchase.order> consists of a date, billing address, shipping address, and a number of items.

<elementType name="company">

  <mixed/>

</elementType>



<elementType name="name">

  <mixed/>

</elementType>



<elementType name="street">

  <mixed/>

</elementType>



<elementType name="city">

  <mixed/>

</elementType>



<elementType name="state">

  <mixed/>

</elementType>



<datatype name="zipCode">

  <basetype name="string"/>

  <lexicalRepresentation>

    <lexical>99999</lexical>

    <lexical>99999-9999</lexical>

  </lexicalRepresentation>

</datatype>



<elementType name="zip">

  <datatypeRef name="zipCode"/>

</elementType>



<elementType name="product.number">

  <mixed/>

</elementType>



<elementType name="description">

  <mixed/>

</elementType>

Most of the address elements are just character data and the ZIP code is defined as described earlier. The <product.number> and <description> are also just character data.

<datatype name="quantityType">

  <basetype name="integer"/>

  <minExclusive>0</minExclusive>

</datatype>



<elementType name="quantity">

  <datatypeRef name="quantityType"/>

</elementType>

The content of the <quantity> element is defined to be an integer larger than zero.

<datatype name="currency">

  <basetype name="decimal"/>

  <precision>8</precision>

  <scale>2</scale>

</datatype>



<elementType name="unitcost">

  <datatypeRef name="currency"/>

</elementType>

For the <unitcost>, it's important that the data entered represent a reasonable price. In this case, I've chosen to allow prices to be up to eight digits long with two digits after the decimal point. That's enough for a million dollar order!

<elementType name="date">

  <datatypeRef name="dateTime"/>

</elementType>



</schema>

Finally, the <date> element uses the builtin dateTime type, and the schema ends with </schema>.