W3C XML Schema Design Patterns: Avoiding Complexity
Over the course of the past year, during which I've worked closely with W3C XML Schema (WXS), I've observed many schema authors struggle with various aspects of the language. Given the size and relative complexity of the WXS recommendation (parts one and two ), it seems that many schema authors would be best served by understanding and utilizing an effective subset instead of attempting to comprehend all of its esoterica.
There have been a few public attempts to define an effective subset of W3C XML Schema for general usage, most notable have been W3C XML Schema Made Simple by Kohsuke Kawaguchi and the X12 Reference Model for XML Design by the Accredited Standards Committee (ASC) X12. However, both documents are extremely conservative and advise against useful features of WXS without adequately describing the cost of doing so.
This article is primarily a counterpoint to Kohsuke's and considers each of his original guidelines; the goal is to provide a set of solid guidelines about what you should do and shouldn't do when working with WXS.
I've altered some of Kohsuke's original guidelines:
targetNamespace
attribute (aka chameleon schema.)I propose some additional guidelines as well:
The guidelines qualified with the word carefully are best avoided by novice users unless absolutely required by the problem being solved.
An element declaration is used to specify the structure, type, occurrence, and value constraints for an element. The element declaration is the most important and common piece of a schema document.
Elements declarations that appear as children of the xs:schema element are global elements, which can be reused by referencing them in other parts of the schema or from other schema documents. They can also be members of substitution groups. Since the WXS recommendation doesn't provide a mechanism for specifying the root element of the document being validated, any global element can be used as the root element for a valid document.
Element declarations that appear within complex type or model group definitions, and that aren't references to a global element, are local elements. Unlike global elements, there can be many local element declarations with the same name and differing types in a schema as long as the local elements are not declared at the same level. Section 3.3 of the W3C XML Schema Primer gives the following example:
You can only declare one global element called "title", and that element is bound to a single type (e.g., xs:string or PersonTitle). However, you can locally declare one element called "title" that has a string type, and is a subelement of "book". Within the same schema (target namespace) you can declare a second element also called "title" that is an enumeration of the values "Mr Mrs Ms".
Global element declarations should be used for elements that will be reused from the target schema as well as from other schema documents, when the element and its associated type are comfortably bound together for widespread use. Local elements are to be favored when element declarations only make sense in the context of the declaring type and are unlikely to be reused.
By default, global elements have a namespace name equivalent to that of the target namespace of the schema, while local elements have no namespace name. So, by default, elements in an XML document which are meant to be validated against global element declarations should have a namespace name identical to that of the global element's schema target namespace. Those which are to be validated against local elements should have no namespace name. For example, consider this schema:
test.xsd
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.com"
xmlns="http://www.example.com">
<!-- global element declaration validates
<language> elements from http://www.example.com
namespace -->
<xs:element name="language" type="xs:string" />
<xs:element name="Root" type="sequenceOfLanguages" />
<xs:element name="Root2" type="sequenceOfLanguages2" />
<!-- complex type with local element declaration
validates <language> elements without a namespace
name -->
<xs:complexType name="sequenceOfLanguages" >
<xs:sequence>
<xs:element name="language" type="xs:NMTOKEN" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
<!-- complex type with reference to global
element declaration -->
<xs:complexType name="sequenceOfLanguages2" >
<xs:sequence>
<xs:element ref="language" maxOccurs="10" />
</xs:sequence>
</xs:complexType>
</xs:schema>
test.xml
<?xml version="1.0"?>
<ex:Root xmlns:ex="http://www.example.com">
<language>EN</language>
</ex:Root>
test2.xml
<?xml version="1.0"?>
<ex:Root2 xmlns:ex="http://www.example.com">
<ex:language>English</ex:language>
<ex:language>Klingon</ex:language>
</ex:Root2>
An attribute declaration is used to specify the type, optionality, and defaulting information for an attribute.
Attribute declarations that appear as children of the xs:schema element are global attributes, which can be reused by referencing them in other parts of the schema or from other schema documents. Attribute declarations that appear within complex type definitions, and that do not reference global attributes, are local attributes.
Global attribute declarations should be used for types that will be reused from the target schema as well as from other schema documents. Local attributes should be used when attribute declarations only make sense in the context of the declaring type and are unlike to be reused. Since attributes are usually tightly coupled to their parent elements, local attribute declarations are typically favored by schema authors. But there are cases where global attributes which can apply to many elements from multiple namespaces are useful (for example, xsi:type and xsi:schemaLocation).
By default global attributes have a namespace name equivalent to that of the target namespace of the schema, while local attributes have no namespace name. Thus, attributes which are to be validated against global attribute declarations should have namespace name identical to that of the global attribute's schema target namespace. Those to be validated against local attributes should have no namespace name. For example,
test.xsd
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.com"
xmlns="http://www.example.com">
<!-- global attribute declaration validates
language attributes from http://www.example.com namespace -->
<xs:attribute name="language" type="xs:string" />
<xs:element name="Root" type="sequenceOfNotes" />
<xs:element name="Root2" type="sequenceOfNotes2" />
<!-- complex type with local attribute
declaration validates language attributes without a
namespace name -->
<xs:complexType name="sequenceOfNotes" >
<xs:sequence>
<xs:element name="Note" type="xs:string" />
</xs:sequence>
<xs:attribute name="language" type="xs:NMTOKEN" />
</xs:complexType>
<!-- complex type with reference to
global attribute declaration -->
<xs:complexType name="sequenceOfNotes2" >
<xs:sequence>
<xs:element name="Note" type="xs:string" />
</xs:sequence>
<xs:attribute ref="language" />
</xs:complexType>
</xs:schema>
test.xml
<?xml version="1.0"?>
<ex:Root xmlns:ex="http://www.example.com" language="EN" >
<Note>Nothing to see here</Note>
</ex:Root>
test2.xml
<?xml version="1.0"?>
<ex:Root2 xmlns:ex="http://www.example.com" ex:language="The English Language">
<Note>Nothing to see here</Note>
</ex:Root2>
|
Support for XML Namespaces is woven tightly into the WXS recommendation. Namespaces are used in a number of places:
Thus, schema authors should be familiar with how namespaces work, including their affect on W3C XML Schema. I wrote two MSDN articles which address this issue: "XML Namespaces and How They Affect XPath and XSLT" provides a detailed overview of XML namespaces and "Working with Namespaces in XML Schema" explains the ramifications of namespaces in WXS.
Elements or attributes with a
namespace name are said to be "namespace qualified". It's possible to
override whether local declarations validate namespace qualified elements
and attributes or not. The
xs:schema element has the elementFormDefault and attributeFormDefault attributes, which specify whether local
declarations in the schema should validate namespace qualified elements
and attributes respectively. The valid values for either attribute are
"qualified" and "unqualified". The default value of both attributes is
"unqualified".
The form attribute on local element and attribute declarations can be
used to override the values of the elementFormDefault and
attributeFormDefault attributes specified on the
xs:schema element. This allows for fine-grained control over the way
validation of elements and attributes in the instance document operates
in relation to global or local declarations.
The following example, taken from the Kohsuke's article (the "Why You Should Avoid Local Declarations" section) shows exactly how these attributes can significantly affect the outcome of validation:
This schema
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://example.com">
<xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element name="familyName" type="xs:string" />
<xs:element name="firstName" type="xs:string" />
<xs:sequence>
<xs:complexType>
<xs:element>
<xs:schema>
validates the following document
<foo:person xmlns:foo="http://example.com">
<familyName> KAWAGUCHI <familyName>
<firstName> Kohsuke <firstName>
<foo:person>
which is unlikely what the schema author intended. And it's ugly, too. Altering the schema thus:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://example.com"
elementFormDefault="qualified">
<xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element name="familyName" type="xs:string" />
<xs:element name="firstName" type="xs:string" />
<xs:sequence>
<xs:complexType>
<xs:element>
<xs:schema>
allows it to validate
<person xmlns="http://example.com">
<familyName> KAWAGUCHI <familyName>
<firstName> Kohsuke <firstName>
<person>
or
<foo:person xmlns:foo="http://example.com">
<foo:familyName> KAWAGUCHI <foo:familyName>
<foo:firstName> Kohsuke <foo:firstName>
<foo:person>
Leaving the value of the attributeFormDefault attribute as
"unqualified" makes sense because most schema authors don't want to have
to namespace qualify all attributes explicitly by prefixing them.
An attribute group definition is a way to create a named collection of attribute declarations and attribute wildcards. Attribute groups increase the modularity of schemas. You can declare a commonly used set of attributes in a single location and then reference them from other schemas.
When Kohsuke's article describes attribute groups as an alternative to global attribute declarations, it may give the incorrect impression that the two are mutually exclusive alternatives. A globally declared attribute is an individual, reusable attribute declaration. An attribute group is a modularly clustered set of attributes; the attribute declarations in an attribute group can either be local attribute declarations or references to global declarations. Kohsuke's article is not entirely accurate when it describes attribute groups as an alternative to global attribute declarations.
A model group definition is a mechanism for creating named groups of elements using the all, choice, or sequence compositors. Model groups are useful for reusing groups of elements by avoiding type derivation. However, model groups are not a replacement for complex types; they cannot contain attribute declarations and they cannot be specified as the type of an element declaration. Additionally, derivation of model groups is much more limited than derivation of complex types.
A major benefit of WXS over DTDs in XML 1.0 is the existence of datatypes. The ability to specify that the values of elements or attributes are strings, dates, or numeric data enables schema authors to specify and validate the contents of XML data in an interoperable and platform independent manner. Given the number of built-in datatypes (44 by my count), it may be wise for schema authors to standardize on a subset of the built-in types to avoid information overload.
In most cases users can do without the subtypes of xs:string (e.g. xs:ENTITY or xs:language), the subtypes of xs:integer (e.g. xs:short or xs:unsignedByte), or the Gregorian date types (e.g. xs:gMonthDay or xs:gYearMonth). Eliminating these types reduces the amount of information to a more easily managed amount.
A complex type definition is used to specify a content model consisting of elements and attributes. An element declaration can specify its content model by referring to a named or anonymous complex type. Named complex types can be referenced by name from the schema they are defined in or by external schema documents; anonymous complex types must be defined within the declaration for the element which uses the type. Additionally the content models of named complex types can be extended or restricted using WXS inheritance mechanisms.
Complex types are similar to model group definitions with two main differences. First, complex type definitions can include attributes in the content models they define. Second, it's possible to use type derivation with complex types, which isn't the case with named model groups. In Kohsuke's article he advocates using a combination of anonymous complex types, model group definitions, and attribute groups to specify the content model of an element instead of named complex types. He does so in an attempt to avoid dealing with what he sees as the complexity of named complex types. However, I'd counter that using three mechanisms instead of one to specify the content model of an element is actual more prone to confusion. Thus, in addition to the fact that named complex types allow for reuse of content models, they're also the most straightforward way of specifying the content model of an element.
Anonymous complex types should only be used if references to the type will not be needed outside the element declaration and there is no need for type derivation. It is important to note that it is not possible to derive a new type from an anonymous complex type. In general, schemas that make heavy use of anonymous types are likely to have problems with uniformity and consistency.
Kohsuke's admonition to avoid notation declarations is spot on. They exist only to provide backward compatibility with DTDs, except they are not backward compatible with DTD notations. Pretend they do not exist. I certainly do.
Substitution Groups provide a mechanism similar to subtype polymorphism in programming languages. One or more elements can be marked as being substitutable for a global element (also called the head element), which means that members of this substitution group are interchangeable with the head element in a content model. For example, for an Address substitution group with members USAddress and UKAddress, the generic element Address can be used in the content model, or it can be substituted by a USAddress or a UKAddress. The only requirement is that the members of the substitution group must be of the same type or be in the same type hierarchy as the head element.
The following is an example schema and the instance which it validates:
example.xsd:
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.com"
xmlns:ex="http://www.example.com"
elementFormDefault="qualified">
<xs:element name="book" type="xs:string" />
<xs:element name="magazine" type="xs:string" substitutionGroup="ex:book" />
<xs:element name="library">
<xs:complexType>
<xs:sequence>
<xs:element ref="ex:book" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
example.xml:
<library xmlns="http://www.example.com">
<magazine>MSDN Magazine</magazine>
<book>Professional XML Databases</book>
</library>
The content model of the library element says that it can
hold one or more book elements. Since magazine
elements are in the book substitution group, it's valid for
magazine elements to appear in the instance XML where
book elements are expected.
Substitution groups make content models more flexible and allow
extensibility in directions the schema author may not have
anticipated. This flexibility is a two-edged sword: although it allows
greater extensibility, it makes processing documents based on such schemas
more difficult. For instance, the code that processes the
library element must not only handle its child
book elements but magazine elements as well. If
the instance document specified additional schemas via the
xsi:schemaLocation attribute, the processing application could have to
deal with even more members of the book substitution group as
children of the library element.
Another complication is that members of a substitution group can be of
a type derived from the substitution group's head. Writing code to
properly handle any derived type generically is difficult, especially
since there are two opposite notions of derivation. The first,
restriction, restricts the range or values in the content model. The
second, extension, adds elements or attributes to the content
model. Certain attributes on element declarations can be used to give
schema authors more control over element substitutions in instance
documents and reduce the likelihood of unexpected substitutions in XML
instance documents. The block attribute is used to specify
whether elements whose types use a certain derivation method can
substitute for the element in an instance document, while the
final attribute is used to specify whether elements whose
types use a certain derivation method can declare themselves to be part of
the target element's substitution group. The default values of the
block and final attributes for all element
declarations in a schema can be specified via the
blockDefault and finalDefault attributes of the
root xs:schema element. By default all substitutions are
allowed without limitation.
|
DTDs provide a mechanism for specifying that an attribute's type is ID, i.e., its value will be unique within the document and matches the Name production in XML 1.0. IDs in XML 1.0 can also be referenced by attributes of type IDREF or IDREFS. For compatibility with DTDs, WXS has the xs:ID, xs:IDREF, and xs:IDREFS types.
WXS identity constraints are used for specifying unique values, keys, or references to keys using XPath expressions defined within the scope of an element declaration. Comparing feature for feature, the identity constraint mechanisms offer more than ID/IDREF. First, there is no limit on the values or types that can be used as part of an identity constraint. IDs can only be one of a specific range of values (e.g., 7 is not a valid ID). A more important benefit of the schema identity constraints is that a ID or IDREF has to be unique within the document, but WXS identity constraints don't. The symbol space for unique IDs is the entire document, but for unique keys it's the target scope of the XPath. This is particularly useful if uniqueness is needed in two overlapping value spaces with different scopes in the same XML document. For example, consider an XML document that contained room numbers and table numbers for a hotel. It is likely that some of the numbers overlap (i.e. there is a room 18 and a table 18), but they should be unique within either value space.
The WXS family of ID types are not exactly compatible with the DTD ID
types. First, the xs:ID, xs:IDREF, and
xs:IDREFS types can be applied to both elements and
attributes in WXS, although they can only apply to attributes in their DTD
equivalents. Second, there's no restriction on how many attributes of type
xs:ID can appear on an element, although such a restriction
exists for ID attributes in the DTD equivalents.
The target namespace of a schema document identifies the namespace name of the elements and attributes which can be validated against the schema. A schema without a target namespace can typically only validate elements and attributes without a namespace name. However, if a schema without a target namespace is included in a schema with a target namespace, the target namespaceless schema assumes the target namespaces of the including schema. This feature is typically called the Chameleon schema design pattern.
In Kohsuke's article he claims that the chameleon schema pattern does not work, which is incorrect. A full rebuttal of Kohsuke's claim was made by Michael Leditschke on XML-DEV, and it shows that the design pattern does work and is useful for creating a reusable module of type definitions and declarations.
There is a problem with combining chameleon schemas with
identity constraints. Although QName references to types,
definitions, and declarations in the chameleon schema are coerced into
the namespace of the including schema, the same is not done for XPath
expressions used by xs:key, xs:keyref, and
xs:unique identity constraints. Consider the following
schema:
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:element name="Root">
<xs:complexType>
<xs:sequence>
<xs:element name="person" type="PersonType" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
<xs:key name="PersonKey">
<xs:selector xpath="person"/>
<xs:field xpath="@name"/>
</xs:key>
<xs:keyref name="BestFriendKey" refer="PersonKey">
<xs:selector xpath="person"/>
<xs:field xpath="@best-friend"/>
</xs:keyref>
</xs:element>
<xs:complexType name="PersonType">
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="best-friend" type="xs:string" />
<xs:attribute name="name" type="xs:string" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:schema>
If this schema is included in another schema with a target namespace,
the XPath expressions in both the key and keyref will fail. In this
specific example, the person element is in no namespace in
the chameleon schema, but once included in another schema it picks up that
target namespace. The XPath expressions which match on a person without a
target namespace will not work without signifying that they no
longer work since processors are not obliged to ensure that path
expressions in identity constraint actually return results.
The point is that it is not advisable to use identity constraints in chameleon schemas.
The primary complaint against default and fixed values is that they cause new data to be inserted into the source XML after validation, thus changing the data. This means that an unvalidated document that has a schema with default values is incomplete. Tying the actual content of the XML document to the validation process is unwise since a schema may not always be available. It's also unwise to assume that consumers of the document will always perform validation.
The xs:QName type has additional validation problems caused by the fact that it has no canonical form. Consider this schema and XML instance:
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.com"
xmlns:ex="http://www.example.com"
xmlns:ex2="ftp://ftp.example.com"
elementFormDefault="qualified">
<xs:element name="Root">
<xs:complexType>
<xs:sequence>
<xs:element name="Node" type="xs:QName" default="ex2:FtpSite" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
<Root xmlns="http://www.example.com"
xmlns:ex2="smtp://smtp.example.org"
xmlns:foo="ftp://ftp.example.com">
<Node />
</Root>
What value should be inserted into the Node element upon
validation? Should it be "ex2:FtpSite"? Even if the ex2 prefix is mapped
to a different namespace in the instance document than in the schema?
Maybe it should be "foo:FtpSite" because the prefix "foo" is mapped to the
same namespace that "ex2" was mapped to in the schema. But then what would
happen if no XML namespace declaration existed for the
ftp://ftp.example.com namespace? Would a namespace
declaration have to be inserted? None of these questions can be answered
in a satisfactory manner without violating some opinions as to what the
correct behavior should be. It is best to avoid using
xs:QName default values because it's unlikely that different
implementations agree on the relevant semantics.
Restriction of a simple type involves constraining the facets of the type, thus reducing the permitted values of the type. Such restrictions involve specifying a maximum length for a string value, specifying a date range, or enumerating the list of permitted values. Types constrained in this manner are very commonly used by schema authors and account for most uses of type derivation in WXS. Such types can be used by both elements and attributes as their type definition.
Extension of simple types allows one to create a complex type (i.e. an element content model) with simple content that has attributes. A typical extension scenario is any situation where an element declaration has a simple type as its content and one or more attributes. Since such element content models occur commonly in XML documents, derivation by extension is another commonly used feature.
As with complex types, there are named and anonymous simple types. Named simple types can be referenced by name from the schema they are defined in or from external schema documents. Anonymous simple types must be defined within the declaration for the element or attribute which uses the type. And type derivation can only be performed on named types.
A common misconception is that anonymous types with the same structure are the same type. In other words, assuming that this schema fragment
<-- fragment A -->
<xs:element name="quantity">
<xs:simpleType>
<xs:restriction base="xs:positiveInteger">
<xs:maxExclusive value="100"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
<xs:element name="size">
<xs:simpleType>
<xs:restriction base="xs:positiveInteger">
<xs:maxExclusive value="100"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
is equivalent to
<-- fragment B -->
<xs:simpleType name="underHundred">
<xs:restriction base="xs:positiveInteger">
<xs:maxExclusive value="100"/>
</xs:restriction>
</xs:simpleType>
<xs:element name="size" type="underHundred"/>
<xs:element name="quantity" type="underHundred"/>
is incorrect with regard to whether both element declarations have the same type. Various aspects of WXS may require element declarations to have the same type (substitution groups, specifying key/keyref pairs, and type derivation). For instance, a keyref must be of the same type as a key. However, most features of WXS assume that the element declarations in fragment A have different types and those in fragment B to have the same type.
Extension of a complex type involves adding extra attributes or elements to the content model in the derived type. Elements added via extension are treated as if they were appended to the content model of the base type in sequence. This technique is useful for extracting the common aspects of a set of complex types and then reusing these commonalities via extending the base type definition. The following schema fragment showing how extension enables the reuse of common aspects of a mailing address is taken from the discussion on complex type extension and example in the WXS Primer.
<xs:complexType name="Address">
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="street" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="USAddress">
<xs:complexContent>
<xs:extension base="Address">
<xs:sequence>
<xs:element name="state" type="USState"/>
<xs:element name="zip" type="xs:positiveInteger"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="UKAddress">
<xs:complexContent>
<xs:extension base="Address">
<xs:sequence>
<xs:element name="postcode" type="UKPostcode"/>
</xs:sequence>
<xs:attribute name="exportCode" type="xs:positiveInteger" fixed="1"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
In this schema the Address type defines the information
common to addresses in general; its derived types add information specific
to addresses from the United States and United Kingdom, respectively. The
ability to reuse and build upon content models using extension is a
powerful and useful feature of WXS that promotes modularity and content uniformity.
There is a caveat for processors that deal with types derived by
extension. This caveat has to do with type-aware processors and the
elements added to a content model by extension. In the future it is
possible that type-aware languages like XQuery or XSLT 2.0 will be able to process
XML elements and attributes polymorphically. For instance, an application
can decide to process all elements of type Address or that
have Address as their base type, choosing to process the
information that is common to all types. However a query such as
//*[. instance of Address]/city
could return unexpected results if dealing with a derived type that extended the content model in the following way
<xs:complexType name="BadAddress">
<xs:complexContent>
<xs:extension base="Address">
<xs:sequence>
<-- address format has two city entries, one for neighborhood
and another for the actual city -->
<xs:element name="city" type="xs:string"/>
<xs:element name="state" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
<xs:attribute name="exportCode" type="positiveInteger" fixed="1"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Although the example is contrived and the scenario seems unlikely, it demonstrates a real risk. A more detailed exposition on this potential problem has been provided by Paul Prescod on XML-DEV.
|
Restriction of complex types involves creating a derived complex type whose content model is a subset of the base type.
The parts of the WXS spec which describe derivation by restriction in complex types (Section 3.4.6 and Section 3.9.6) are generally considered to be its most complex parts. Most bugs in implementations cluster around this feature, and it is quite common to see implementers express exasperation when discussing the various nuances of derivation by restriction in complex types. Further, this kind of derivation does not neatly map to concepts in either object oriented programming or relational database theory, which are the primary producers and consumers of XML data. This is the exact opposite of the situation with derivation by extension of complex types.
Another challenge in using derivation by restriction of complex types
arises from the way in which restrictions are declared: when a given
complex type is to be derived by restriction from another complex type,
its content model must be duplicated and refined. Duplication of a
definition replicates definitions, possibly down a long derivation chain,
so any modification to an ancestor type must be manually propagated down
the derivation tree. Furthermore, such replication cannot cross namespace
boundaries -- deriving ns2:SlowCar from ns1:Car
may not work if ns2:SlowCar's has a child element,
ns2:MaxSpeed, because it cannot be correctly derived from
ns1:Car's child element ns1:MaxSpeed.
The following schema uses derivation by restriction to restrict a complex
type, which describes a subscriber to the XML-DEV mailing list, to a type
that describes me. Any element that conforms to the
DareObasanjo type can also be validated as an instance of the
XML-Deviant type.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema>
<!-- base type -->
<xs:complexType name="XML-Deviant">
<xs:sequence>
<xs:element name="numPosts" type="xs:integer" minOccurs="0"
maxOccurs="1" />
<xs:element name="signature" type="xs:string" nillable="true" />
</xs:sequence>
<xs:attribute name="firstSubscribed" type="xs:date" use="optional" />
<xs:attribute name="mailReader" type="xs:string"/>
</xs:complexType>
<!-- derived type -->
<xs:complexType name="DareObasanjo">
<xs:complexContent>
<xs:restriction base="XML-Deviant">
<xs:sequence>
<xs:element name="numPosts" type="xs:integer" minOccurs="1" />
<xs:element name="signature" type="xs:string" nillable="false" />
</xs:sequence>
<xs:attribute name="firstSubscribed" type="xs:date" use="required" />
<xs:attribute name="mailReader" type="xs:string" fixed="Microsoft Outlook" />
</xs:restriction>
</xs:complexContent>
</xs:complexType>
</xs:schema>
Derivation by restriction of complex types is a multifaceted feature that is useful in situations where secondary types need to conform to a generic primary type, but also add their own constraints which go beyond those of the primary type. However, its extreme complexity requires that it be used only by those who have a firm grasp of WXS.
Borrowing a concept from OOP languages like C# and Java, both element declarations and complex type definitions can be made abstract. An abstract element declaration cannot be used to validate an element in an XML instance document and can only appear in content models via substitution. An abstract complex type definition similarly cannot be used to validate an element in an XML instance document; but it can be used as the the abstract parent of an element's derived type or in cases where the element's type is overridden in the instance using xsi:type.
Abstract complex types and element declarations are useful for creating
generic base types which contain information common to a set of types
(such as Shape vs. Circle or Square), yet the definition is
not deemed "complete" unless further derivation (extension or restriction)
has been applied. While this feature is not complicated to use, some
implications of its use are subtle and complex. Abstract types should be
used with care.
WXS provides the wildcards
xs:any and xs:anyAttribute which can be used to
allow the occurrence of elements and attributes from specified namespaces
into a content model. Wildcards allow schema authors to enable
extensibility of the content model while maintaining a degree of control
over the occurrence of elements and attributes. A good discussion of the
benefits of using wildcards is available in an XML.com article, "W3C XML
Schema Design Patterns: Dealing With Change".
Cautious schema authors, concerned with the problems posed by type
derivation, may choose to block attempts at type derivation using the
final attribute on complex type definitions and element
declarations (similar to sealed in C# and final
in Java). They may then choose to allow extensibility at specific parts of
the content model by using wildcards. This gives schema authors more
control over the content models they define and may reduce some of the
problems with various aspects of complex type derivation (specifically
derivation by extension).
It should be noted that wildcards sometimes cause problems with non-determinism that violate the Unique Particle Attribution rule if used improperly. The following schema causes such a problem.
<?xml version="1.0" encoding="utf-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.com/fruit/"
elementFormDefault="qualified">
<xs:complexType name="myKitchen">
<xs:choice maxOccurs="unbounded">
<xs:any processContents="skip" />
<xs:element name="apple" type="xs:string"/>
<xs:element name="cherry" type="xs:string"/>
</xs:choice>
</xs:complexType>
</xs:schema>
The content model of the myKitchen type is such that it
can contain one or more apple, cherry, or any
other element. However, during validation, if an apple
element is seen, the compiler cannot tell whether it should be validated
against the wildcard or the apple element declaration.
There are subtle but potentially profound ramifications to the
selection of both the namespace attribute and the
processContents attribute. Overly restrictive values can
impede extensibility; overly loose values can open the schema up to
abuse. Controlling the supported namespaces for a wildcard can also be
bewildering, especially when the set of allowable namespaces is subject to
change.
Redefinition is a feature of WXS that allows you to change the meaning of an included type or group definition. Using xs:redefine, schema authors can include type or group definitions from schema documents and alter these definitions in a pervasive manner. Redefinition is pervasive because it not only affects type or group definitions in the including schema but also those in the included schema as well. Thus all references to the original type or group in both schemas refer to the redefined type, while the original definition is overshadowed. This leads to the problems pointed out in "W3C XML Schema Design Patterns: Dealing With Change":
This causes a certain degree of fragility because redefined types can adversely interact with derived types and generate conflicts. A common conflict is when a derived type uses extension to add an element or attribute to a type's content model, and a redefinition also adds a similarly named element or attribute to the content model
A major problem with type redefinition is that unlike type derivation
it cannot be prevented by using the block or
final attributes. Thus any schema can have its types
redefined in a pervasive manner, thus altering their semantics
completely. It is advisable to avoid this feature due to the potential
conflicts it can cause.
Many schema authors attempt to use type redefinition to increase the value space of an enumeration but this does not work. The only way to increase the number of values accepted by an enumeration used as a base type is to create a union. However, those additional values are only available to applications of the resulting union type, not for the applications of the original base type. Also note that chained redefinitions (redefining a redefine) can be problematic, resulting in unexpected definition clashes.
The WXS recommendation is a complex specification because it attempts to solve complex problems. One can reduce its burdens by utilizing its simpler aspects. Schema authors should ensure that their schemas validate in multiple schema processors. Schemas are an important facilitator of interoperability. It's foolish to depend on the nuances of a specific implementation and inadvertently give up this interoperability.
I'd like to thank Priya Lakshminarayanan and Mark Feblowitz for their help with this article.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.