XML.com 
 Published on XML.com http://www.xml.com/pub/a/2002/11/20/schemas.html
See this if you're having trouble printing code examples

 

W3C XML Schema Design Patterns: Avoiding Complexity
By Dare Obasanjo
November 20, 2002

Table of Contents

Introduction

Over the course of the past year, during which I've worked closely with W3C XML Schema (WXS), I've observed many schema authors struggle with various aspects of the language. Given the size and relative complexity of the WXS recommendation (parts one and two ), it seems that many schema authors would be best served by understanding and utilizing an effective subset instead of attempting to comprehend all of its esoterica.

There have been a few public attempts to define an effective subset of W3C XML Schema for general usage, most notable have been W3C XML Schema Made Simple by Kohsuke Kawaguchi and the X12 Reference Model for XML Design by the Accredited Standards Committee (ASC) X12. However, both documents are extremely conservative and advise against useful features of WXS without adequately describing the cost of doing so.

This article is primarily a counterpoint to Kohsuke's and considers each of his original guidelines; the goal is to provide a set of solid guidelines about what you should do and shouldn't do when working with WXS.

The Guidelines

I've altered some of Kohsuke's original guidelines:

I propose some additional guidelines as well:

The guidelines qualified with the word carefully are best avoided by novice users unless absolutely required by the problem being solved.

Why You Should Use Global And Local Element Declarations

An element declaration is used to specify the structure, type, occurrence, and value constraints for an element. The element declaration is the most important and common piece of a schema document.

Elements declarations that appear as children of the xs:schema element are global elements, which can be reused by referencing them in other parts of the schema or from other schema documents. They can also be members of substitution groups. Since the WXS recommendation doesn't provide a mechanism for specifying the root element of the document being validated, any global element can be used as the root element for a valid document.

Element declarations that appear within complex type or model group definitions, and that aren't references to a global element, are local elements. Unlike global elements, there can be many local element declarations with the same name and differing types in a schema as long as the local elements are not declared at the same level. Section 3.3 of the W3C XML Schema Primer gives the following example:

You can only declare one global element called "title", and that element is bound to a single type (e.g., xs:string or PersonTitle). However, you can locally declare one element called "title" that has a string type, and is a subelement of "book". Within the same schema (target namespace) you can declare a second element also called "title" that is an enumeration of the values "Mr Mrs Ms".

Global element declarations should be used for elements that will be reused from the target schema as well as from other schema documents, when the element and its associated type are comfortably bound together for widespread use. Local elements are to be favored when element declarations only make sense in the context of the declaring type and are unlikely to be reused.

By default, global elements have a namespace name equivalent to that of the target namespace of the schema, while local elements have no namespace name. So, by default, elements in an XML document which are meant to be validated against global element declarations should have a namespace name identical to that of the global element's schema target namespace. Those which are to be validated against local elements should have no namespace name. For example, consider this schema:

test.xsd
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
 targetNamespace="http://www.example.com"
 xmlns="http://www.example.com">

 <!-- global element declaration validates
    <language> elements from http://www.example.com
    namespace  -->
 <xs:element name="language" type="xs:string" />
 <xs:element name="Root" type="sequenceOfLanguages" />
 <xs:element name="Root2" type="sequenceOfLanguages2" />
 
 <!-- complex type with local element declaration
    validates <language> elements without a namespace
    name -->
 <xs:complexType name="sequenceOfLanguages" >  
  <xs:sequence>
   <xs:element name="language" type="xs:NMTOKEN" maxOccurs="unbounded" />
  </xs:sequence>
 </xs:complexType>

 <!-- complex type with reference to global
    element declaration -->
  <xs:complexType name="sequenceOfLanguages2" >  
  <xs:sequence>
   <xs:element ref="language" maxOccurs="10" />
  </xs:sequence>
 </xs:complexType>
</xs:schema>

test.xml
<?xml version="1.0"?>
<ex:Root xmlns:ex="http://www.example.com">
 <language>EN</language> 
</ex:Root> 

test2.xml
<?xml version="1.0"?>
<ex:Root2 xmlns:ex="http://www.example.com">
 <ex:language>English</ex:language> 
 <ex:language>Klingon</ex:language> 
</ex:Root2> 

Why You Should Use Global And Local Attribute Declarations

An attribute declaration is used to specify the type, optionality, and defaulting information for an attribute.

Attribute declarations that appear as children of the xs:schema element are global attributes, which can be reused by referencing them in other parts of the schema or from other schema documents. Attribute declarations that appear within complex type definitions, and that do not reference global attributes, are local attributes.

Global attribute declarations should be used for types that will be reused from the target schema as well as from other schema documents. Local attributes should be used when attribute declarations only make sense in the context of the declaring type and are unlike to be reused. Since attributes are usually tightly coupled to their parent elements, local attribute declarations are typically favored by schema authors. But there are cases where global attributes which can apply to many elements from multiple namespaces are useful (for example, xsi:type and xsi:schemaLocation).

By default global attributes have a namespace name equivalent to that of the target namespace of the schema, while local attributes have no namespace name. Thus, attributes which are to be validated against global attribute declarations should have namespace name identical to that of the global attribute's schema target namespace. Those to be validated against local attributes should have no namespace name. For example,

test.xsd
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
 targetNamespace="http://www.example.com" 
 xmlns="http://www.example.com">

 <!-- global attribute declaration validates
    language attributes from http://www.example.com namespace  --> 
 <xs:attribute name="language" type="xs:string" />
 <xs:element name="Root" type="sequenceOfNotes" />
 <xs:element name="Root2" type="sequenceOfNotes2" />

 <!-- complex type with local attribute
    declaration validates language attributes without a
    namespace name -->
 <xs:complexType name="sequenceOfNotes" >  
  <xs:sequence>
   <xs:element name="Note" type="xs:string" />
  </xs:sequence>
  <xs:attribute name="language" type="xs:NMTOKEN"  /> 
 </xs:complexType>

 <!-- complex type with reference to
    global attribute declaration -->
  <xs:complexType name="sequenceOfNotes2" >  
  <xs:sequence>
   <xs:element name="Note" type="xs:string" />
  </xs:sequence>
  <xs:attribute ref="language" />
 </xs:complexType>
</xs:schema>

test.xml
<?xml version="1.0"?>
<ex:Root xmlns:ex="http://www.example.com" language="EN" >
 <Note>Nothing to see here</Note> 
</ex:Root> 

test2.xml
<?xml version="1.0"?>
<ex:Root2 xmlns:ex="http://www.example.com" ex:language="The English Language">
 <Note>Nothing to see here</Note> 
</ex:Root2> 

Why You Should Understand How XML Namespaces Affect WXS

Support for XML Namespaces is woven tightly into the WXS recommendation. Namespaces are used in a number of places:

Table of Contents

Thus, schema authors should be familiar with how namespaces work, including their affect on W3C XML Schema. I wrote two MSDN articles which address this issue: "XML Namespaces and How They Affect XPath and XSLT" provides a detailed overview of XML namespaces and "Working with Namespaces in XML Schema" explains the ramifications of namespaces in WXS.

Why You Should Always Set elementFormDefault to "qualified"

Elements or attributes with a namespace name are said to be "namespace qualified". It's possible to override whether local declarations validate namespace qualified elements and attributes or not. The xs:schema element has the elementFormDefault and attributeFormDefault attributes, which specify whether local declarations in the schema should validate namespace qualified elements and attributes respectively. The valid values for either attribute are "qualified" and "unqualified". The default value of both attributes is "unqualified".

The form attribute on local element and attribute declarations can be used to override the values of the elementFormDefault and attributeFormDefault attributes specified on the xs:schema element. This allows for fine-grained control over the way validation of elements and attributes in the instance document operates in relation to global or local declarations.

The following example, taken from the Kohsuke's article (the "Why You Should Avoid Local Declarations" section) shows exactly how these attributes can significantly affect the outcome of validation:

This schema

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
      targetNamespace="http://example.com">
  <xs:element name="person">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="familyName" type="xs:string" />
        <xs:element name="firstName" type="xs:string" />
      <xs:sequence>
    <xs:complexType>
  <xs:element>
<xs:schema>

validates the following document

<foo:person xmlns:foo="http://example.com">
  <familyName> KAWAGUCHI <familyName>
  <firstName> Kohsuke <firstName>
<foo:person>

which is unlikely what the schema author intended. And it's ugly, too. Altering the schema thus:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
      targetNamespace="http://example.com" 
     elementFormDefault="qualified">
  <xs:element name="person">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="familyName" type="xs:string" />
        <xs:element name="firstName" type="xs:string" />
      <xs:sequence>
    <xs:complexType>
  <xs:element>
<xs:schema>

allows it to validate

<person xmlns="http://example.com">
  <familyName> KAWAGUCHI <familyName>
  <firstName> Kohsuke <firstName>
<person>

or

<foo:person xmlns:foo="http://example.com">
  <foo:familyName> KAWAGUCHI <foo:familyName>
  <foo:firstName> Kohsuke <foo:firstName>
<foo:person>

Leaving the value of the attributeFormDefault attribute as "unqualified" makes sense because most schema authors don't want to have to namespace qualify all attributes explicitly by prefixing them.

Why You Should Use Attribute Groups

An attribute group definition is a way to create a named collection of attribute declarations and attribute wildcards. Attribute groups increase the modularity of schemas. You can declare a commonly used set of attributes in a single location and then reference them from other schemas.

When Kohsuke's article describes attribute groups as an alternative to global attribute declarations, it may give the incorrect impression that the two are mutually exclusive alternatives. A globally declared attribute is an individual, reusable attribute declaration. An attribute group is a modularly clustered set of attributes; the attribute declarations in an attribute group can either be local attribute declarations or references to global declarations. Kohsuke's article is not entirely accurate when it describes attribute groups as an alternative to global attribute declarations.

Why You Should Use Model Groups

A model group definition is a mechanism for creating named groups of elements using the all, choice, or sequence compositors. Model groups are useful for reusing groups of elements by avoiding type derivation. However, model groups are not a replacement for complex types; they cannot contain attribute declarations and they cannot be specified as the type of an element declaration. Additionally, derivation of model groups is much more limited than derivation of complex types.

Why You Should Use The Builtin Simple Types

A major benefit of WXS over DTDs in XML 1.0 is the existence of datatypes. The ability to specify that the values of elements or attributes are strings, dates, or numeric data enables schema authors to specify and validate the contents of XML data in an interoperable and platform independent manner. Given the number of built-in datatypes (44 by my count), it may be wise for schema authors to standardize on a subset of the built-in types to avoid information overload.

In most cases users can do without the subtypes of xs:string (e.g. xs:ENTITY or xs:language), the subtypes of xs:integer (e.g. xs:short or xs:unsignedByte), or the Gregorian date types (e.g. xs:gMonthDay or xs:gYearMonth). Eliminating these types reduces the amount of information to a more easily managed amount.

Why You Should Use Complex Types

A complex type definition is used to specify a content model consisting of elements and attributes. An element declaration can specify its content model by referring to a named or anonymous complex type. Named complex types can be referenced by name from the schema they are defined in or by external schema documents; anonymous complex types must be defined within the declaration for the element which uses the type. Additionally the content models of named complex types can be extended or restricted using WXS inheritance mechanisms.

Complex types are similar to model group definitions with two main differences. First, complex type definitions can include attributes in the content models they define. Second, it's possible to use type derivation with complex types, which isn't the case with named model groups. In Kohsuke's article he advocates using a combination of anonymous complex types, model group definitions, and attribute groups to specify the content model of an element instead of named complex types. He does so in an attempt to avoid dealing with what he sees as the complexity of named complex types. However, I'd counter that using three mechanisms instead of one to specify the content model of an element is actual more prone to confusion. Thus, in addition to the fact that named complex types allow for reuse of content models, they're also the most straightforward way of specifying the content model of an element.

Anonymous complex types should only be used if references to the type will not be needed outside the element declaration and there is no need for type derivation. It is important to note that it is not possible to derive a new type from an anonymous complex type. In general, schemas that make heavy use of anonymous types are likely to have problems with uniformity and consistency.

Why You Should Not Use Notation Declarations

Kohsuke's admonition to avoid notation declarations is spot on. They exist only to provide backward compatibility with DTDs, except they are not backward compatible with DTD notations. Pretend they do not exist. I certainly do.

Why You Should Use Substitution Groups Carefully

Substitution Groups provide a mechanism similar to subtype polymorphism in programming languages. One or more elements can be marked as being substitutable for a global element (also called the head element), which means that members of this substitution group are interchangeable with the head element in a content model. For example, for an Address substitution group with members USAddress and UKAddress, the generic element Address can be used in the content model, or it can be substituted by a USAddress or a UKAddress. The only requirement is that the members of the substitution group must be of the same type or be in the same type hierarchy as the head element.

The following is an example schema and the instance which it validates:

example.xsd:
 <xs:schema 
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 targetNamespace="http://www.example.com"
 xmlns:ex="http://www.example.com"
 elementFormDefault="qualified">

  <xs:element name="book" type="xs:string" />

  <xs:element name="magazine" type="xs:string" substitutionGroup="ex:book" />

 <xs:element name="library">
 <xs:complexType>
  <xs:sequence>
   <xs:element ref="ex:book" maxOccurs="unbounded"/>
  </xs:sequence>
 </xs:complexType>
 </xs:element>


</xs:schema>
example.xml:
<library xmlns="http://www.example.com">
 <magazine>MSDN Magazine</magazine>
 <book>Professional XML Databases</book>
</library>

The content model of the library element says that it can hold one or more book elements. Since magazine elements are in the book substitution group, it's valid for magazine elements to appear in the instance XML where book elements are expected.

Substitution groups make content models more flexible and allow extensibility in directions the schema author may not have anticipated. This flexibility is a two-edged sword: although it allows greater extensibility, it makes processing documents based on such schemas more difficult. For instance, the code that processes the library element must not only handle its child book elements but magazine elements as well. If the instance document specified additional schemas via the xsi:schemaLocation attribute, the processing application could have to deal with even more members of the book substitution group as children of the library element.

Another complication is that members of a substitution group can be of a type derived from the substitution group's head. Writing code to properly handle any derived type generically is difficult, especially since there are two opposite notions of derivation. The first, restriction, restricts the range or values in the content model. The second, extension, adds elements or attributes to the content model. Certain attributes on element declarations can be used to give schema authors more control over element substitutions in instance documents and reduce the likelihood of unexpected substitutions in XML instance documents. The block attribute is used to specify whether elements whose types use a certain derivation method can substitute for the element in an instance document, while the final attribute is used to specify whether elements whose types use a certain derivation method can declare themselves to be part of the target element's substitution group. The default values of the block and final attributes for all element declarations in a schema can be specified via the blockDefault and finalDefault attributes of the root xs:schema element. By default all substitutions are allowed without limitation.

Why You Should Favor key/keyref/unique Over ID/IDREF For Identity Constraints

DTDs provide a mechanism for specifying that an attribute's type is ID, i.e., its value will be unique within the document and matches the Name production in XML 1.0. IDs in XML 1.0 can also be referenced by attributes of type IDREF or IDREFS. For compatibility with DTDs, WXS has the xs:ID, xs:IDREF, and xs:IDREFS types.

Table of Contents

WXS identity constraints are used for specifying unique values, keys, or references to keys using XPath expressions defined within the scope of an element declaration. Comparing feature for feature, the identity constraint mechanisms offer more than ID/IDREF. First, there is no limit on the values or types that can be used as part of an identity constraint. IDs can only be one of a specific range of values (e.g., 7 is not a valid ID). A more important benefit of the schema identity constraints is that a ID or IDREF has to be unique within the document, but WXS identity constraints don't. The symbol space for unique IDs is the entire document, but for unique keys it's the target scope of the XPath. This is particularly useful if uniqueness is needed in two overlapping value spaces with different scopes in the same XML document. For example, consider an XML document that contained room numbers and table numbers for a hotel. It is likely that some of the numbers overlap (i.e. there is a room 18 and a table 18), but they should be unique within either value space.

The WXS family of ID types are not exactly compatible with the DTD ID types. First, the xs:ID, xs:IDREF, and xs:IDREFS types can be applied to both elements and attributes in WXS, although they can only apply to attributes in their DTD equivalents. Second, there's no restriction on how many attributes of type xs:ID can appear on an element, although such a restriction exists for ID attributes in the DTD equivalents.

Why You Should Use Chameleon Schemas Carefully

The target namespace of a schema document identifies the namespace name of the elements and attributes which can be validated against the schema. A schema without a target namespace can typically only validate elements and attributes without a namespace name. However, if a schema without a target namespace is included in a schema with a target namespace, the target namespaceless schema assumes the target namespaces of the including schema. This feature is typically called the Chameleon schema design pattern.

In Kohsuke's article he claims that the chameleon schema pattern does not work, which is incorrect. A full rebuttal of Kohsuke's claim was made by Michael Leditschke on XML-DEV, and it shows that the design pattern does work and is useful for creating a reusable module of type definitions and declarations.

There is a problem with combining chameleon schemas with identity constraints. Although QName references to types, definitions, and declarations in the chameleon schema are coerced into the namespace of the including schema, the same is not done for XPath expressions used by xs:key, xs:keyref, and xs:unique identity constraints. Consider the following schema:


<xs:schema
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 elementFormDefault="qualified">

 <xs:element name="Root">

  <xs:complexType>
    <xs:sequence>
     <xs:element name="person" type="PersonType" maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>

  <xs:key name="PersonKey">
   <xs:selector xpath="person"/>
   <xs:field xpath="@name"/>
  </xs:key>

  <xs:keyref name="BestFriendKey" refer="PersonKey">
   <xs:selector xpath="person"/>
   <xs:field xpath="@best-friend"/>
  </xs:keyref>

 </xs:element>

 <xs:complexType name="PersonType">
  <xs:simpleContent>
   <xs:extension base="xs:string">
    <xs:attribute name="best-friend" type="xs:string" />
    <xs:attribute name="name" type="xs:string" />
   </xs:extension>
  </xs:simpleContent>
 </xs:complexType>

</xs:schema>

If this schema is included in another schema with a target namespace, the XPath expressions in both the key and keyref will fail. In this specific example, the person element is in no namespace in the chameleon schema, but once included in another schema it picks up that target namespace. The XPath expressions which match on a person without a target namespace will not work without signifying that they no longer work since processors are not obliged to ensure that path expressions in identity constraint actually return results.

The point is that it is not advisable to use identity constraints in chameleon schemas.

Why You Should Not Use Default Or Fixed Values Especially For Types Of xs:QName.

The primary complaint against default and fixed values is that they cause new data to be inserted into the source XML after validation, thus changing the data. This means that an unvalidated document that has a schema with default values is incomplete. Tying the actual content of the XML document to the validation process is unwise since a schema may not always be available. It's also unwise to assume that consumers of the document will always perform validation.

The xs:QName type has additional validation problems caused by the fact that it has no canonical form. Consider this schema and XML instance:


 <xs:schema
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 targetNamespace="http://www.example.com"
 xmlns:ex="http://www.example.com"
 xmlns:ex2="ftp://ftp.example.com"
 elementFormDefault="qualified">

 <xs:element name="Root">
  <xs:complexType>
    <xs:sequence>
     <xs:element name="Node" type="xs:QName" default="ex2:FtpSite" />
    </xs:sequence>
  </xs:complexType>
 </xs:element>

</xs:schema>

<Root xmlns="http://www.example.com" 
  xmlns:ex2="smtp://smtp.example.org" 
  xmlns:foo="ftp://ftp.example.com">
 <Node />
</Root>

What value should be inserted into the Node element upon validation? Should it be "ex2:FtpSite"? Even if the ex2 prefix is mapped to a different namespace in the instance document than in the schema? Maybe it should be "foo:FtpSite" because the prefix "foo" is mapped to the same namespace that "ex2" was mapped to in the schema. But then what would happen if no XML namespace declaration existed for the ftp://ftp.example.com namespace? Would a namespace declaration have to be inserted? None of these questions can be answered in a satisfactory manner without violating some opinions as to what the correct behavior should be. It is best to avoid using xs:QName default values because it's unlikely that different implementations agree on the relevant semantics.

Why You Should Use Restriction And Extension Of Simple Types

Restriction of a simple type involves constraining the facets of the type, thus reducing the permitted values of the type. Such restrictions involve specifying a maximum length for a string value, specifying a date range, or enumerating the list of permitted values. Types constrained in this manner are very commonly used by schema authors and account for most uses of type derivation in WXS. Such types can be used by both elements and attributes as their type definition.

Extension of simple types allows one to create a complex type (i.e. an element content model) with simple content that has attributes. A typical extension scenario is any situation where an element declaration has a simple type as its content and one or more attributes. Since such element content models occur commonly in XML documents, derivation by extension is another commonly used feature.

As with complex types, there are named and anonymous simple types. Named simple types can be referenced by name from the schema they are defined in or from external schema documents. Anonymous simple types must be defined within the declaration for the element or attribute which uses the type. And type derivation can only be performed on named types.

A common misconception is that anonymous types with the same structure are the same type. In other words, assuming that this schema fragment


<-- fragment A -->

<xs:element name="quantity">
 <xs:simpleType>
   <xs:restriction base="xs:positiveInteger">
    <xs:maxExclusive value="100"/>
   </xs:restriction>
  </xs:simpleType>
</xs:element>

<xs:element name="size">
 <xs:simpleType>
   <xs:restriction base="xs:positiveInteger">
    <xs:maxExclusive value="100"/>
   </xs:restriction>
  </xs:simpleType>
</xs:element>

is equivalent to

<-- fragment B -->

<xs:simpleType name="underHundred">
 <xs:restriction base="xs:positiveInteger">
  <xs:maxExclusive value="100"/>
 </xs:restriction>
</xs:simpleType>

<xs:element name="size" type="underHundred"/> 

<xs:element name="quantity" type="underHundred"/>

is incorrect with regard to whether both element declarations have the same type. Various aspects of WXS may require element declarations to have the same type (substitution groups, specifying key/keyref pairs, and type derivation). For instance, a keyref must be of the same type as a key. However, most features of WXS assume that the element declarations in fragment A have different types and those in fragment B to have the same type.

Why You Should Use Extension Of Complex Types

Extension of a complex type involves adding extra attributes or elements to the content model in the derived type. Elements added via extension are treated as if they were appended to the content model of the base type in sequence. This technique is useful for extracting the common aspects of a set of complex types and then reusing these commonalities via extending the base type definition. The following schema fragment showing how extension enables the reuse of common aspects of a mailing address is taken from the discussion on complex type extension and example in the WXS Primer.


<xs:complexType name="Address">
  <xs:sequence>
   <xs:element name="name"   type="xs:string"/>
   <xs:element name="street" type="xs:string"/>
   <xs:element name="city"   type="xs:string"/>
  </xs:sequence>
 </xs:complexType>

 <xs:complexType name="USAddress">
  <xs:complexContent>
   <xs:extension base="Address">
    <xs:sequence>
     <xs:element name="state" type="USState"/>
     <xs:element name="zip"   type="xs:positiveInteger"/>
    </xs:sequence>
   </xs:extension>
  </xs:complexContent>
 </xs:complexType>

 <xs:complexType name="UKAddress">
  <xs:complexContent>
   <xs:extension base="Address">
    <xs:sequence>
     <xs:element name="postcode" type="UKPostcode"/>
    </xs:sequence>
    <xs:attribute name="exportCode" type="xs:positiveInteger" fixed="1"/>
   </xs:extension>
  </xs:complexContent>
 </xs:complexType>

In this schema the Address type defines the information common to addresses in general; its derived types add information specific to addresses from the United States and United Kingdom, respectively. The ability to reuse and build upon content models using extension is a powerful and useful feature of WXS that promotes modularity and content uniformity.

There is a caveat for processors that deal with types derived by extension. This caveat has to do with type-aware processors and the elements added to a content model by extension. In the future it is possible that type-aware languages like XQuery or XSLT 2.0 will be able to process XML elements and attributes polymorphically. For instance, an application can decide to process all elements of type Address or that have Address as their base type, choosing to process the information that is common to all types. However a query such as

//*[. instance of Address]/city

could return unexpected results if dealing with a derived type that extended the content model in the following way


 <xs:complexType name="BadAddress">
  <xs:complexContent>
   <xs:extension base="Address">
    <xs:sequence>
     <-- address format has two city entries, one for neighborhood 
	         and another for the actual city -->
     <xs:element name="city" type="xs:string"/>
     <xs:element name="state" type="xs:string"/>
     <xs:element name="country" type="xs:string"/>
    </xs:sequence>
    <xs:attribute name="exportCode" type="positiveInteger" fixed="1"/>
   </xs:extension>
  </xs:complexContent>
 </xs:complexType>

Although the example is contrived and the scenario seems unlikely, it demonstrates a real risk. A more detailed exposition on this potential problem has been provided by Paul Prescod on XML-DEV.

Why You Should Very Carefully Use Restriction Of Complex Types

Restriction of complex types involves creating a derived complex type whose content model is a subset of the base type.

The parts of the WXS spec which describe derivation by restriction in complex types (Section 3.4.6 and Section 3.9.6) are generally considered to be its most complex parts. Most bugs in implementations cluster around this feature, and it is quite common to see implementers express exasperation when discussing the various nuances of derivation by restriction in complex types. Further, this kind of derivation does not neatly map to concepts in either object oriented programming or relational database theory, which are the primary producers and consumers of XML data. This is the exact opposite of the situation with derivation by extension of complex types.

Another challenge in using derivation by restriction of complex types arises from the way in which restrictions are declared: when a given complex type is to be derived by restriction from another complex type, its content model must be duplicated and refined. Duplication of a definition replicates definitions, possibly down a long derivation chain, so any modification to an ancestor type must be manually propagated down the derivation tree. Furthermore, such replication cannot cross namespace boundaries -- deriving ns2:SlowCar from ns1:Car may not work if ns2:SlowCar's has a child element, ns2:MaxSpeed, because it cannot be correctly derived from ns1:Car's child element ns1:MaxSpeed.

The following schema uses derivation by restriction to restrict a complex type, which describes a subscriber to the XML-DEV mailing list, to a type that describes me. Any element that conforms to the DareObasanjo type can also be validated as an instance of the XML-Deviant type.


<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema>

 <!-- base type -->
 <xs:complexType name="XML-Deviant">
  <xs:sequence>
   <xs:element name="numPosts" type="xs:integer" minOccurs="0"
maxOccurs="1" /> 
   <xs:element name="signature" type="xs:string" nillable="true" />
  </xs:sequence>
  <xs:attribute name="firstSubscribed" type="xs:date" use="optional" />
  <xs:attribute name="mailReader" type="xs:string"/>
 </xs:complexType>

 <!-- derived type --> 
  <xs:complexType name="DareObasanjo">
   <xs:complexContent>
   <xs:restriction base="XML-Deviant">
   <xs:sequence>
    <xs:element name="numPosts" type="xs:integer" minOccurs="1" /> 
    <xs:element name="signature" type="xs:string" nillable="false" />
   </xs:sequence>
   <xs:attribute name="firstSubscribed" type="xs:date" use="required" />
   <xs:attribute name="mailReader" type="xs:string" fixed="Microsoft Outlook" />
   </xs:restriction>
   </xs:complexContent>
  </xs:complexType> 

</xs:schema>

Derivation by restriction of complex types is a multifaceted feature that is useful in situations where secondary types need to conform to a generic primary type, but also add their own constraints which go beyond those of the primary type. However, its extreme complexity requires that it be used only by those who have a firm grasp of WXS.

Why You Should Carefully Use Abstract Types

Borrowing a concept from OOP languages like C# and Java, both element declarations and complex type definitions can be made abstract. An abstract element declaration cannot be used to validate an element in an XML instance document and can only appear in content models via substitution. An abstract complex type definition similarly cannot be used to validate an element in an XML instance document; but it can be used as the the abstract parent of an element's derived type or in cases where the element's type is overridden in the instance using xsi:type.

Abstract complex types and element declarations are useful for creating generic base types which contain information common to a set of types (such as Shape vs. Circle or Square), yet the definition is not deemed "complete" unless further derivation (extension or restriction) has been applied. While this feature is not complicated to use, some implications of its use are subtle and complex. Abstract types should be used with care.

Do Use Wildcards to Provide Well Defined Points Of Extensibility

WXS provides the wildcards xs:any and xs:anyAttribute which can be used to allow the occurrence of elements and attributes from specified namespaces into a content model. Wildcards allow schema authors to enable extensibility of the content model while maintaining a degree of control over the occurrence of elements and attributes. A good discussion of the benefits of using wildcards is available in an XML.com article, "W3C XML Schema Design Patterns: Dealing With Change".

Cautious schema authors, concerned with the problems posed by type derivation, may choose to block attempts at type derivation using the final attribute on complex type definitions and element declarations (similar to sealed in C# and final in Java). They may then choose to allow extensibility at specific parts of the content model by using wildcards. This gives schema authors more control over the content models they define and may reduce some of the problems with various aspects of complex type derivation (specifically derivation by extension).

It should be noted that wildcards sometimes cause problems with non-determinism that violate the Unique Particle Attribution rule if used improperly. The following schema causes such a problem.


<?xml version="1.0" encoding="utf-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
 targetNamespace="http://www.example.com/fruit/"
 elementFormDefault="qualified">

<xs:complexType name="myKitchen">
        <xs:choice maxOccurs="unbounded">
              <xs:any processContents="skip" />
              <xs:element name="apple" type="xs:string"/>
              <xs:element name="cherry" type="xs:string"/>            
        </xs:choice>
</xs:complexType>

</xs:schema>

The content model of the myKitchen type is such that it can contain one or more apple, cherry, or any other element. However, during validation, if an apple element is seen, the compiler cannot tell whether it should be validated against the wildcard or the apple element declaration.

There are subtle but potentially profound ramifications to the selection of both the namespace attribute and the processContents attribute. Overly restrictive values can impede extensibility; overly loose values can open the schema up to abuse. Controlling the supported namespaces for a wildcard can also be bewildering, especially when the set of allowable namespaces is subject to change.

Do Not Use Group or Type Redefinition

Redefinition is a feature of WXS that allows you to change the meaning of an included type or group definition. Using xs:redefine, schema authors can include type or group definitions from schema documents and alter these definitions in a pervasive manner. Redefinition is pervasive because it not only affects type or group definitions in the including schema but also those in the included schema as well. Thus all references to the original type or group in both schemas refer to the redefined type, while the original definition is overshadowed. This leads to the problems pointed out in "W3C XML Schema Design Patterns: Dealing With Change":

This causes a certain degree of fragility because redefined types can adversely interact with derived types and generate conflicts. A common conflict is when a derived type uses extension to add an element or attribute to a type's content model, and a redefinition also adds a similarly named element or attribute to the content model

A major problem with type redefinition is that unlike type derivation it cannot be prevented by using the block or final attributes. Thus any schema can have its types redefined in a pervasive manner, thus altering their semantics completely. It is advisable to avoid this feature due to the potential conflicts it can cause.

Many schema authors attempt to use type redefinition to increase the value space of an enumeration but this does not work. The only way to increase the number of values accepted by an enumeration used as a base type is to create a union. However, those additional values are only available to applications of the resulting union type, not for the applications of the original base type. Also note that chained redefinitions (redefining a redefine) can be problematic, resulting in unexpected definition clashes.

Conclusion

The WXS recommendation is a complex specification because it attempts to solve complex problems. One can reduce its burdens by utilizing its simpler aspects. Schema authors should ensure that their schemas validate in multiple schema processors. Schemas are an important facilitator of interoperability. It's foolish to depend on the nuances of a specific implementation and inadvertently give up this interoperability.

Acknowledgments

I'd like to thank Priya Lakshminarayanan and Mark Feblowitz for their help with this article.

XML.com Copyright © 1998-2006 O'Reilly Media, Inc.