Menu

W3C XML Schema Made Simple

June 6, 2001

Kohsuke Kawaguchi

Overview

It's easy to learn and use W3C XML Schema once you know how to avoid the pitfalls. You should at least learn the following things.

  • Do use element declarations, attribute groups, model groups, and simple types.
  • Do use XML namespaces as much as possible. Learn the correct way to use them.
  • Do not try to be a master of XML Schema. It would take months.
  • Do not use complex types (why?), attribute declarations (why?), or notations (why?).
  • Do not use local declarations (why?).
  • Do not use substitution groups (why?).
  • Do not use a schema without the targetNamespace attribute (aka chameleon schema.) (why?)

You won't lose anything by following these guidelines, as the rest of this article demonstrates.

Too long to remember? Then try the one-line version:

Consider W3C XML Schema as DTD + datatype + namespace

The rest of this article justifies these recommendations. At times it gets a bit hairy, so if you're willing to take my word for it, you can stop reading now.

Motivation for this Article

Several similar documents on XML Schema are already available. I discovered, however, that they're written by brilliant people who always drive things to the limit. They simply can't stop inventing cool tricks that even working group members can't imagine, and XML Schema is their new favorite toy.

This document is for those who want to use W3C XML Schema for business, and for those who are at a loss how to use it. The goal is to provide a set of solid guidelines about what you should do and what you shouldn't do.

Why You Should Avoid Complex Types

If you don't know what a complex type is, then don't let it trouble you. Whatever small gain this functionality offers is vastly outweighed by its complexity. Furthermore, you won't lose anything by not using complex types: if a schema can be written by using complex types, then you can always write it without complex types. To be precise, you can always write it without understanding complex types, but unfortunately you have to type <complexType> elements.

Just consider a <complexType> as something you have to write as a sole child of the <element> element. That is, you write element declarations as follows:

<xs:element name="head">

  <xs:complexType>   <!-- consider this as a place holder -->

    

    <!-- define content model by using model groups. -->

    ...

    <!-- then refer to attribute groups -->

    <xs:attributeGroup ref="head.attributes" />

    

  </xs:complexType>

</xs:element>

Why spend your precious time learning something you don't need? Convinced? Then there is no need to read more.

In short, a complex type is a model group, plus inheritance, minus ease of use. A complex type and a model group are similar in the sense that they are used to define content models. A complex type lacks ease of use because you can't use it from other complex types or model groups. On the other hand, model groups can be used without such restriction.

Inheritance

Inheritance is a complex type's only advantage, but you really don't want to use it. There are two types of inheritance: specifically, extension and restriction.

Extension allows you to append additional elements after the content model of the base type. The following model group reproduces the semantics of the extension, showing that you don't need a complex type to do this.

<xs:group name="extendedType">

  <xs:sequence>

    <xs:group ref="baseType"/>

    

    <!-- append things that you want -->

    ....

  </xs:sequence>

</xs:group>

Restriction allows you to restrict the content model of the base type. But even if you use this functionality, you still have to write the whole content model of the new type. Basically you type the same thing whether you use a complex type or a model group.

What do you get by using the restriction? Error checking. That's it. Validators are supposed to report an error if you fail to make a content model a restricted one. Unfortunately, this is hardly an advantage.

First, strictly enforcing this check is a difficult job for validators. You can look at the part of the spec that defines this constraint. The entire section 3.9.6 is devoted to specifying what is allowed and what is not. There's a strong temptation for developers to skip the enforcement of this constraint because most people won't notice that the check is skipped. At the time of this writing, no validators are known to strictly enforce this constraint.

It's unlikely that your validator is even capable of fully enforcing this constraint, which removes the only advantage of restriction.

Second, even if you write the restriction correctly, you may get an error from your validator. Consider the following example:

Base type:

<xs:all>

  <xs:element name="a" />

  <xs:element name="b" />

  <xs:element name="c" minOccurs="0" />

<xs:all>



New type derived by restriction:

<xs:all>

  <xs:element name="b" />

  <xs:element name="a" />

<xs:all>

The latter looks like a proper restriction of the former. In fact, every content model that is accepted by the new type is also accepted by the base type. But W3C XML Schema prohibits this. Specifically, this derivation violates "schema component constraint: particle derivation OK (all:all, sequence:sequence -- recurse)". This is just the tip of the iceberg. If you are interested in this issue, consult the last page of MSL.

None of these problems occur if you use model groups instead of complex types. When it comes to derivation by restriction, a general understanding isn't enough; you need a very detailed understanding of how it works.

Why You Should Avoid Attribute Declarations

To be precise, what you should avoid is global attribute declarations, not local attribute declarations. The following is an example of a global attribute declaration.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

      targetNamespace="http://example.com">

  <!-- attribute whose name is foo -->

  <xs:attribute name="foo" type="xs:float" />

  

  <xs:element name="root">

    <xs:complexType>

      <xs:attribute ref="foo" />

    </xs:complexType>

  </xs:element>

</xs:schema>

This schema does not accept the following instance.

<root xmlns="http://example.com" foo="5.12"/>

Rather, it accepts the following instance, which likely isn't what you want:

<root xmlns="http://example.com"

       ns:foo="5.12" xmlns:ns="http://example.com" />

Attribute groups do not have this problem. So instead of using an attribute declaration, you should use an attribute group.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

      targetNamespace="http://example.com">

  <xs:attributeGroup name="root.attributes">

    <!-- attribute whose name is foo -->

    <xs:attribute name="foo" type="xs:float" />

  <xs:attributeGroup>

  

  <xs:element name="root">

    <xs:complexType>

      <!-- content model -->

      ...

      <xs:attributeGroup ref="root.attributes" />

    </xs:complexType>

  </xs:element>

</xs:schema>

An attribute group can refer to other attribute groups. In this way, you can write common attributes in one attribute group, then refer to it from others.

Why You Should Avoid Notation Declarations

If you haven't heard about notations, please be assured that you aren't missing anything. Notations exist only for backward compatibility. There is no need to learn about them.

If you do know notations, you should know that notations in W3C XML Schema are not compatible with notations in DTDs, because a Schema notation is a QName.

The following example is from the spec.

<xs:notation name="jpeg"

             public="image/jpeg" system="viewer.exe" />



<xs:element name="picture">

 <xs:complexType>

  <xs:simpleContent>

   <xs:extension base="xs:hexBinary">

    <xs:attribute name="pictype">

     <xs:simpleType>

      <xs:restriction base="xs:NOTATION">

       <xs:enumeration value="jpeg"/>

       <xs:enumeration value="png"/>

        ...

      </xs:restriction>

     </xs:simpleType>

    </xs:attribute>

   </xs:extension>

  </xs:simpleContent>

 </xs:complexType>

</xs:element>



<picture pictype="jpeg"> ... </picture>

This example is okay. But the following fragment is not valid even if the prefix "pic" is properly declared.

<pic:picture pictype="jpeg"> ... </pic:picture>

Confused? You have to write it as follows because it's a QName.

<pic:picture pictype="pic:jpeg"> ... </pic:picture>

Apparently it fails to serve its only reason for existing. There's really no reason to use notations. Notations are for SGML.

Why You Should Avoid Local Declarations

W3C XML Schema allows you to declare elements inside another element:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

      targetNamespace="http://example.com">

  <xs:element name="person">

    <xs:complexType>

      <xs:sequence>

        <xs:element name="familyName" type="xs:string" />

        <xs:element name="firstName" type="xs:string" />

      <xs:sequence>

    <xs:complexType>

  <xs:element>

<xs:schema>

But generally you should avoid this if possible, because the above schema does not match the following instance:

<person xmlns="http://example.com">

  <familyName> KAWAGUCHI <familyName>

  <firstName> Kohsuke <firstName>

<person>

Rather, you have to write it as

<foo:person xmlns:foo="http://example.com">

  <familyName> KAWAGUCHI <familyName>

  <firstName> Kohsuke <firstName>

<foo:person>

Not only does this require more typing, it is also a bad use of XML Namespaces. To avoid this problem, you should write

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

      targetNamespace="http://example.com">

  <xs:element name="person">

    <xs:complexType>

      <xs:sequence>

        <xs:element ref="familyName" />

        <xs:element ref="firstName" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="familyName" type="xs:string" />

  <xs:element name="firstName" type="xs:string" />

</xs:schema>

Another way to solve the problem is to add elementFormDefault="qualified" to the schema element. You can then safely use local element declarations. It probably isn't worth the effort to understand exactly what this means. Just understand that it makes the schema behave in the "right" way.

Why You Should Avoid Substitution Groups

In short, substitution groups are too complex to be a practical mechanism. There are two main difficulties.

  • To write substitution groups correctly, you have to master the complex type, which is itself another beast that you should avoid.
  • It is hard to tell which elements are actually substitutable.

Simply put, a substitution group is another way to write a <choice>. So you can always use a <choice> instead of a substitution group; and <choice> is necessary anyway.

To use substitution groups properly, first you have to learn complex types, then several additional attributes, rules to use them, and finally the effect of using them. Even if you manage to get through this brave new world, your document authors still need to follow the same path all over again because otherwise they can't write documents properly. What a pity.

If you still think you want to use substitution groups, it's not as easy as you think.

First, the content model of substitution group members must be related to each other by type derivation. That means you cannot write content models freely. Soon you'll find yourself writing an abstract element as a substitution group head with a strange content model, just to maintain proper derivations between members. That's not right.

Second, attributes to control the substitution behavior are difficult to use and understand. There is an attribute called block, which is one of the attributes you use to control the substitution group. There is another attribute called final, which basically takes one of "extension", "restriction", or "#all" as its value.

final may look irrelevant to the substitution group, but the truth is that it's called "substitution group exclusions" internally and, as its name suggests, it controls the behavior of the substitution group. The internal name of the block attribute is "disallowed substitutions". Having trouble understanding the difference? Yeah, me too. Actually, both are used to control the substitution behavior, but in a different way.

The only way to prohibit the substitution of element Y by element Z is to add block="substitution" to Y. But even with the presence of this attribute, it is not an error to have Z in the substitution group of Y. It's just that you can never substitute Y with Z in your documents.

Even worse, if Y designates yet another element X as its substitution group head ( X <- Y <- Z ), then it's okay to substitute X with Z.

All these things make it impractical to use a substitution group in the real world, although it may look harmless when you are experimenting. And that's why you should avoid it.

Why You Should Avoid Chameleon Schemas

W3C XML Schema allows the schema element without the targetNamespace attribute. Some people call such schemas chameleons. Why they are called that is irrelevant; what you should know is to avoid them.

One reason is that it's highly likely that validators will have interoperability problems here. Another reason is that some people like to invent cool tricks by using a chameleon schema. But don't be fooled by those tricks; they are for schema hackers, not for ordinary good citizens.

Unfortunately, if you want to know exactly why you should avoid them, then you have to learn what they are. Consider the following chameleon schema.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <!-- note that targetNamespace attribute is absent. -->

  

  <xs:element name="person">

    <xs:complexType>

      <xs:sequence>

        <xs:element ref="familyName" />

        <xs:element ref="firstName" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="familyName" type="xs:string"/>

  <xs:element name="firstName" type="xs:string"/>

</xs:schema>

Then you write another schema file and include the one above by using the include element.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

           targetNamespace="http://example.com">

  

  <xs:include schemaLocation="above.xsd" />

  

  <xs:element name="root">

    <xs:complexType>

      <xs:sequence>

        <xs:element ref="person" maxOccurs="unbounded" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

</xs:schema>

It seems OK, but actually it isn't. Look at the red line. It looks like a reference to the familyName element. But it isn't. Since this chameleon schema is included by a schema with targetNamespace="http://example.com/", the familyName element is in this namespace. So to refer to this declaration, you have to rewrite the red line as

<xs:element ref="bp:familyName" xmlns:bp="http://example.com" />

Now what happens if you want to reuse this chameleon schema from a schema whose target namespace is http://www.foo.com? The answer: you can't.

As you can see, the sole merit of using the chameleon schema is gone.

Even worse, you can't detect this error in some validators because they think that those missing components may appear afterward.

Conclusion

There are many pitfalls in XML Schema that should be avoided, which will make your life easier because you'll have less to learn. And you won't lose the expressiveness of W3C XML Schema. Keep it simple and have a happy life!