Understanding W3C Schema Complex Types
Are W3C XML Schema complex types so difficult to understand that you shouldn't even bother trying? Kohsuke Kawaguchi thinks so; or so he claimed in his recent XML.com article, in which he offered assurances that you can write complex types without understanding them.
My response to that assertion is to ask why would you want to write complex types without understanding them, especially when they are easily understandable? There are four things you need to know in order to understand complex types in W3C Schemas. These four things are easy to understand. See what you think about Kawaguchi's argument after learning them.
One of the most important, but least emphasized, aspects of W3C schemas is the type hierarchy. The importance of the type hierarchy can hardly be overstated. Why? Because the syntax for expressing types in schemas follows precisely from the type hierarchy.
XML Schema Part 2: Datatypes, Section 3 contains a helpful graphic that explains the schema type hierarchy.
Schema types form a hierarchy because they all derive, directly or
indirectly, from the root type. The root type is
anyType. (You can actually use
anyType in an
element declaration; it allows any content whatsoever.) The type
hierarchy first branches into two groups: simple types and complex
types. Here we encounter the first two of the four things you need to
know in order to understand complex types: first, derivation is
the basis of connection between types in the type hierarchy; and,
second, the initial branching of the hierarchy is into simple and
To derive a type means to take an existing type (called the "base") and modify it in some way so as to produce a new type. There are four kinds of derivation: restriction, extension, list, and union. This discussion looks at derivation by restriction and extension since they are the most commonly used.
Derivation by restriction takes an existing type as the base and creates a new type by limiting its allowed content to a subset of that allowed by the base type. Derivation by extension takes an existing type as the base and creates a new type by adding to its allowed content.
Simple types and complex types differ in this way: simple types cannot have element children or attributes; complex types may have element children and attributes.
Tracing the type hierarchy down the branch of simple types, we see
that the first simple type is
anySimpleType, which is
also type that you could actually use. W3C XML Schemas has 44 built-in
simple types, each of which derives from
and all but three of which derive by restriction. Not a single one
derives by extension. To extend a simple type would mean to add
element children or an attribute. This contradicts the definition of
simple type in W3C Schemas and is thus prohibited.
Thinking in terms of the type hierarchy, it ought to be relatively
straightforward to derive a new simple, "myNameType", that restricts
its base type, "string", to a specific, fixed subset, "Don Smith". The
W3C XML Schema fragment for expressing
<simpleType name="myNameType"> <restriction base="string"> <enumeration value="Don Smith" /> </restriction> </simpleType>
As you can see, the XML Schema for this type definition follows the type hierarchy exactly -- except for the enumeration element, one of the twelve facets that can be used to qualify types. We won't look at facets here since our concern is with the relation between the type hierarchy and the Schema syntax for expressing types. I simply used this one to complete the example. (See Table B1.a.Simple Types and Applicable Facets in Schema Part 0: Primer for a convenient list of facets.)
Now I just have to associate
myNameType with an
element, and then I can use the type in an XML document. So the
<element name="employee" type="dc:myNameType" />
lets me use
Smith</dc:employee> in an XML document instance.
Does the syntax for complex types also follow the logic of the type hierarchy? Yes. But the type hierarchy diagram doesn't help us at this point because it doesn't provide two crucial pieces of information about complex types. However, once you understand these two points, complex types lose their indecipherable complexity and become quite intelligible.
Complex types are divided into two groups: those with simple content and those with complex content. And that leads us to the third thing you need to know in order to understand complex types: while both forms of complex type allow attributes, only those with complex content allow child elements; those with simple content only allow character content.
In other words, the difference between complex types with simple content and complex types with complex content is that the former do not allow element children while the latter do. That's it. The two forms of complex type are represented in what I call the Schema Type Decision Tree (PDF) under the complex type branch.
Let's suppose I want to add an attribute to
myNameType. Adding an attribute to a simple type always
moves it into the complex type branch of the type hierarchy. Once on
the complex type branch, I must ask a second question. Do I want the
new type to allow element content? If I don't, then my new type must
be a complex type with simple content. After that, I simply take
dc:myNameType as the base type and extend it by adding an
<complexType name="myNewNameType"> <simpleContent> <extension base="dc:myNameType"> <attribute name="position" type="string" /> </extension> </simpleContent> </complexType>
Now, after declaring my element "employee" to be of
myNewNameType, I can have
position="trainer"> Don Smith</dc:employee> in my XML
It may seem odd that adding an attribute to a simple type requires the creation of a new complex type, one that has simple content to boot. But that's the logic of the type hierarchy: a type that has attributes must be a complex type, and that type can either allow element children or not. Perhaps an odd logic, but it is intelligible.
Let's suppose now that I want my complex type to have child elements. That requires a complex type with complex content. So I simply add my content model and attributes (if any). That's easy. But maybe too easy. We must be careful not skip over a crucial fact that makes a big difference.
Adding a content model is still a derivation of a new type from
some base type. If I do not take an existing complex type as the base
for the new derivation, what will I use for a base type? I'll use
anyType. The vast majority of types that allow element content are
anyType. For example,
<complexType name="myNewNameType"> <complexContent> <restriction base="anyType"> <sequence> <element name="name" type="string" /> <element name="location" type="string" /> </sequence> <attribute name="position" type="string" /> </restriction> </complexContent> </complexType> <element name="employee" type="dc:myNewNameType" />
The type associated with "employee" now has an element named "name" followed by an element named "location". Further, personnel can have an attribute named "position":
<dc:employee position="trainer"> <dc:name>Don Smith</dc:name> <dc:location>Dallas, TX</dc:location> </dc:employee>
The logic behind the syntax is straightforward. I want a type that
allows child elements. That requires a complex type with complex
content, while still deriving a new type from a base type. In this
case I'm restricting
anyType; I could as easily extend another type. I
add my content model and an attribute declaration. I'm done, and it
was all pretty easy.
|Convinced that W3C XML Schemas aren't so hairy after all, or do you still have questions? Ask them here in our forum.|
|Post your comments|
But can't this be expressed more concisely? Yes, it can. There is
an abbreviated form for all complex type definitions that have complex
content and restrict
anyType. You simply leave out the
<complexContent> and <restriction base="anyType"> elements:
<complexType name="myNewNameType"> <sequence> <element name="name" type="string" /> <element name="location" type="string" /> </sequence> <attribute name="position" type="string" /> </complexType>
This type definition is equivalent to the previous one. And that
leads us to the fourth thing you need to know in order to understand
complex types: the default syntax for complex types is complex
content that restricts
Why didn't I show you the abbreviated syntax first? Because the abbreviation obscures the logic behind the default syntax. If all you see is <complexType> followed by a content model, it's totally confusing as to why complex types sometimes have <complexContent> or <simpleContent> child elements or, often, neither.
Now that you know the logic behind the two forms of complex type, you won't be confused when you see a complex type that has neither <complexContent> nor <simpleContent>. You know what the default is.
Writing type definitions for empty elements turns out to be counter-intuitive, but, fortunately, the logic behind the complex type syntax still holds. Remember that an empty element is one that has neither data content nor child elements. It may have an attribute. Let's take the case of an empty element that doesn't have an attribute.
Your first inclination might be to associate the empty element with a simple type. But that won't work since simple types allow data content. So it must be a complex type. The, ask yourself the next question. Will it allow element children? No. We need a <complexType> with <simpleContent>, right?
Wrong. Complex types with simple content also allow data content, and we want an empty element. That leaves us with <complexType> with <complexContent>, which ensures that there will not be any data content in the element. But we don't want child elements, either, and a complex type with complex content allows child elements. The key is that it doesn't require them. What do we do? Simply leave the content model out of the type definition:
<complexType name="processingHook"> <complexContent> <restriction base="anyType"> </restriction> </complexContent> </complexType> <element name="callMyApp" type="dc:processingHook" />
Our type definition, now associated with the element "callMyApp", allows the markup <callMyApp/> to occur in my XML document instance.
Now apply the default syntax for complex types to this type definition. An definition equivalent to the one above is
<complexType name="processingHook"> </complexType>
It's no wonder that people get confused about complex types. They generally don't realize that all complex types are divisible into two kinds: those with simple content and those with complex content. The reason why people don't generally realize this is because they normally learn the abbreviated syntax first. But, as we've seen, if you learn the full syntax and the logic behind it first, then the abbreviated syntax, and complex types in general, cease to be a befuddingly conundrum.
If all of this is now as clear to you as it is to me, you don't have to trust anyone's assurances that you should use complex types without understanding them. You can now use and understand them.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.