Profiling XML Schema
September 20, 2006
XML Schema is now 5 years old, having matured from a newborn into an active youngster. So what have we learned about this young one's personality? We've always known it was complex. Indeed, the original debate about whether to make it a Recommendation indicated concern. (See Last Word and Questionnaire.) This rich toolset has caused schema designers to wonder which features they should or should not use. If we analyze what people are actually implementing, perhaps we can glean some guidance. I decided to embark on a quest to see if we can put together a profile of XML Schema based on experiences thus far.
A profile is a set of agreed upon practices reflecting the most commonly accepted usage patterns of a given technology. A usage profile of XML Schema would indicate a series of features that are commonly implemented and supported in tools.
The concept of a profile of XML Schema has been debated over the years. In 2004, the Web Services Interoperability Consortium went so far as to charter a work group (PDF) to examine the idea of a formal profile for XML Schema. It led the W3C to hold a Workshop on XML Schema 1.0 User Experiences, where input from many sources was culled into a plan of action. The resultant work in the XML Schema Patterns for Databinding Working Group is ongoing, having recently issued a draft document.
In addition, many industry consortia have issued design guidelines and/or patterns for developing libraries of schemas according to their profile. Having read many of these either formally or informally, they are often explicit about what features of XML Schema are allowed or disallowed. Indeed, tools for addressing enforcement of schema profiles have emerged. Schematron is often used to add additional constraints on top of the ones in a schema. Mindreef's SOAPScope Server has explicit support for creating customizable profiles of schema, offering a standard check box listing of constructs that are enforced in test cases. However, I wondered if there was a cross-industry usage profile.
Strengths and Weaknesses
The first stop in the search for a profile is with an analysis of schema itself. In fact, for some time now, we've known about the pluses and minuses of many schema design techniques, as put together by Roger Costello on his Best Practices website. He has done an excellent job of gathering commentary and opinion from implementers, and assembling it into a cohesive analysis of the benefits and drawbacks of many design criteria. Wise schema designers have referred to this site for years.
On a practical level, schema designers may also wonder what impact each feature of XML Schema has down the road when it comes time for implementation. Some constructs are ubiquitous and well supported in IDEs and other tools such as code generating software. However, there are some that are not commonly used and can cause problems when coders are confronted with a lack of tool support.
What Are Schema Designers Actually Doing?
In the next phase of my profile quest, I wanted to take what Costello has done a step further and ask: what features of XML Schema are folks actually using? Is there a consensus of opinion on the most common constructs? Are there features schema designers are avoiding? I accumulated data on over 1,400 schemas from numerous standards consortia to see if there is a common XML Schema profile reflecting a consensus of practice.
I focused on consortia schemas thinking that these should reflect a group's consensus of design criteria as well as have a disproportionate impact on the marketplace, as they are standard and implemented many times across a domain. These schemas are also all freely available.
I examined schemas from the following organizations:
- The Open Applications Group (OAGi)
- The Open Travel Alliance (OTA)
- Human Resources XML (HR-XML)
- Chemical Industry Data Exchange (CIDX)
- IMS Global Learning Consortium (IMS)
- Association for Retail Technology Standards (ARTS)
- Mortgage Industry Standards Maintenance Organization (MISMO)
- World Wide Web Consortium (W3C, including mathML)
- Global Justice XML
There are many other consortia that could, and with infinite time would, be added to this analysis.
Many of the more mature tools have a high level of support for XML Schema features.
particular, IDEs that edit schemas have good support even for problematic features
xsd:union elements. The problem with tool support comes in two forms. First,
some have chosen not to support selected schema constructs. This amounts to a profile
design. Secondly, "best-of-breed" tools can offer support only for the most common
constructs in their early releases. As the tool matures, it may plan to add support
additional features. I've blogged about tool support of XML Schema and shown references to a few
code-generation tools and their publicly available claims of support.
The Results: A Profile of XML Schema
There is a clear tendency for simplicity based on the usage patterns of the 1,414
tested. There are just six design features used in at least one third of the schemas.
However 17 features only occur in 10 percent or less of the test cases. Many of the
constructs would only be used in very specific situations with a special need. In
to simplicity, explicitness is a secondary pattern, reflected in the high level of
namespace qualification, the lack of mixed content models and abstract types, and
overwhelming preference for
Features Avoided (Occurring in Less than 10 Percent of the Schemas)
These XML Schema design features were either used minimally or not at all in the schemas tested.
xsd:all: The use of the
xsd:allcompositor. The clear preference is to use
- Finalizing: The
use of the
@finalDefaultattributes. None of the tested schemas used these. The test cases seem to focus on enabling features rather than disabling ones such as these.
substitutionGroup: Allows elements to be substituted for other elements. While this feature was not used in the schemas tested, it is a common extension mechanism. The lack of
substitutionGroups is probably due to the nature of the test cases. Open, standard domain consortia schemas may not need this feature, although organizations implementing them may need
substitutionGroups for extensibility.
- Uniqueness: The use
uniqueelement requiring the contents to be distinctive within its scope. I found this interesting, as the need for unique IDs was common; however, most often simple strings were used for that data. Perhaps uniqueness is enforced in the business layer of a data transfer between systems.
- Qualified attributes: This is
the use of
attributeFormDefault="qualified". Almost none of the schema designers felt the need to do this, although the vast majority qualified the elements.
- Keys: The use
- Redefine: Using the
redefineelement to change the definition of an existing component. This feature is also among the least supported in tools. Consistently avoided.
- Nillable: The use of the
@nillableattribute, allowing the use of
xsi:nilin the instance indicating the contents have a null value.
- Block: The use of
@blockattribute to disallow derivations.
complexTyperestriction: Restricting the content model of a
complexType. Years ago at HR-XML, we looked at this feature to enable us to have a generalized data type that is constrained depending on the context of its usage. However, we found restricting
complexTypes to be cumbersome, verbose, and not well supported at that time.
- Abstract types: The use of
abstract="true"on elements or types.
- Mixed: Setting the
mixed="true"combines data and child elements in one place. Schema designers have clearly separated these concepts into distinct types.
- Groups: The use of
xsd:groupis a way to define a group for later reuse. It may be that elements were simply referred to with
"@ref"rather than put into groups.
- Fixed values: The
@fixedattribute on elements, attributes, or
@targetNamespaceattribute: Having no
@targetNamespacemay be used in late binding of schemas. However, the vast majority of test cases used it. I have almost always seen it as a requirement in schema design guides.
- No default namespace
declared: Containing no
"@xmlns"default namespace (no prefix). Again, this may be used in late binding. Default namespaces were consistently used in the test cases.
- Default namespace not equal to
@targetNamespace(PDF): This occurs when the default namespace value does not match the
@targetNamespace. Not only did the vast majority of schemas have both a default and a
@targetNamespace, but they were the same value. This reflects the tendency toward simplicity.
Features Used Frequently (Occurring in at Least One-Third of the Schemas)
The most commonly used XML Schema design features. Here again, simplicity rules. It is wise to begin schema design with this toolset.
- Namespace qualified elements:
elementFormDefault="qualified"for explicitness of the element namespaces.
xsd:sequence: The use of the
xsd:sequenceelement. The most common compositor. I've recommended this over
xsd:allbecause it leads to fewer ambiguous content models and its child elements occur in a predictable order.
complexTypeextension: Creating a type that extends another is a key reusability and extensibility point.
- Anonymous types: These
occur when types are created that are locally scoped and thus have no
"@name"attribute. Very commonly done. Tools may prefer no anonymous types, but none I have tried was unable to accommodate them.
simpleTyperestriction: Derivation of a
simpleTyperestricting its base type.
- Enumerations: Enumerated values were the most frequently used construct.
Problems in the Middle
These XML Schema design features were commonly used but may have problematic tool support. It is a good idea to check with the tools you plan to use before adding these features to your design.
attributeGroup: A grouping of attributes by name for reusability, similar to
xsd:group. This feature may have shown up in the test cases more than is actually used. First, one organization used them heavily, obscuring the fact that the others tended to avoid them. Second, the analysis searched for "xsd:attributeGroup" resulting in matches for both declarations and reuse or "@ref". So the actual number of these may be much smaller. My experience with tools was that
attributeGroups were not hard to support, but simply weren't the highest priority.
xsd:choice: The use of the
xsd:choicecompositor. Some tool makers have expressed concern about this feature because it is not easily mapped to a programming construct. However, its usage is common.
- Default values: Declaring default values for data in the XML instance. I've blogged about default values before.
xsd:union: The use of
xsd:unionto combine types in a declaration. I've found this is the least supported feature of XML Schema in tools.
- Pattern: Uses regular expressions to subset strings, among other things.
- Other facets: Includes
facets other than pattern and enumeration, namely
maxLength. Again, tool support varies.
- List types: The use of the
xsd:listelement. This feature only occurred in about 10 percent of the test cases. It is sometimes unsupported in tools and may be a cause of concern. I've heard many complaints from coders about programming to parse through and process list types. They much preferred the use of enumerations or separate data types.
A Note on Wildcards
These were in the middle of the list; however, I suspect that they are actually used
more frequently. Some of these consortia create a single wildcard extension element
simply referred to (with
"@ref") as needed. So the actual number of wildcard
elements is lower than the usage of those elements.
Examining what schema designers are actually implementing can indeed reveal a usage profile of XML Schema. It is in this profile of practice that our five-year-old's personality emerges. The clearest message is one of simplicity. The most commonly used constructs involve merely creating reusable types, assembling them into sequences of elements, and augmenting them with enumerations. Many of the more complex features went unused. In addition, the test cases also reflected explicitness in their schemas, as evidenced in the avoidance of mixing or abstracting content and the qualifying of element form defaults. Adhering to the design patterns reflected in this usage profile will serve schema designers well.
Appendix: The Data
The data in these tables indicate the results of my research. They were all downloaded in early September 2006 from their respective websites (many of them are listed here). Figure 1 is a summary, Figure 2 indicates how many schemas contain the XML Schema design feature listed, and Figure 3 shows the number of times the feature occurred.
Figure 1. Summary of data
A few duplicative schemas were removed from the analysis, such as the schema for schemas
(XMLSchema.xsd), which was commonly distributed with many libraries. ACORD also
offers no namespace equivalents of their schemas. For this analysis, the namespaced
were used. In both the HR-XML and OAGi test files, the developer or "non-standalone"
versions of the schemas were analyzed. While there are no
the OAGi schemas, the global element design is intended to enable substitutions as
extension point. The W3C list of schemas includes mathML.