Profiling XML Schema

September 20, 2006

XML Schema is now 5 years old, having matured from a newborn into an active youngster. So what have we learned about this young one's personality? We've always known it was complex. Indeed, the original debate about whether to make it a Recommendation indicated concern. (See Last Word and Questionnaire.) This rich toolset has caused schema designers to wonder which features they should or should not use. If we analyze what people are actually implementing, perhaps we can glean some guidance. I decided to embark on a quest to see if we can put together a profile of XML Schema based on experiences thus far.

Background

A profile is a set of agreed upon practices reflecting the most commonly accepted usage patterns of a given technology. A usage profile of XML Schema would indicate a series of features that are commonly implemented and supported in tools.

The concept of a profile of XML Schema has been debated over the years. In 2004, the Web Services Interoperability Consortium went so far as to charter a work group (PDF) to examine the idea of a formal profile for XML Schema. It led the W3C to hold a Workshop on XML Schema 1.0 User Experiences, where input from many sources was culled into a plan of action. The resultant work in the XML Schema Patterns for Databinding Working Group is ongoing, having recently issued a draft document.

In addition, many industry consortia have issued design guidelines and/or patterns for developing libraries of schemas according to their profile. Having read many of these either formally or informally, they are often explicit about what features of XML Schema are allowed or disallowed. Indeed, tools for addressing enforcement of schema profiles have emerged. Schematron is often used to add additional constraints on top of the ones in a schema. Mindreef's SOAPScope Server has explicit support for creating customizable profiles of schema, offering a standard check box listing of constructs that are enforced in test cases. However, I wondered if there was a cross-industry usage profile.

Strengths and Weaknesses

The first stop in the search for a profile is with an analysis of schema itself. In fact, for some time now, we've known about the pluses and minuses of many schema design techniques, as put together by Roger Costello on his Best Practices website. He has done an excellent job of gathering commentary and opinion from implementers, and assembling it into a cohesive analysis of the benefits and drawbacks of many design criteria. Wise schema designers have referred to this site for years.

On a practical level, schema designers may also wonder what impact each feature of XML Schema has down the road when it comes time for implementation. Some constructs are ubiquitous and well supported in IDEs and other tools such as code generating software. However, there are some that are not commonly used and can cause problems when coders are confronted with a lack of tool support.

What Are Schema Designers Actually Doing?

In the next phase of my profile quest, I wanted to take what Costello has done a step further and ask: what features of XML Schema are folks actually using? Is there a consensus of opinion on the most common constructs? Are there features schema designers are avoiding? I accumulated data on over 1,400 schemas from numerous standards consortia to see if there is a common XML Schema profile reflecting a consensus of practice.

I focused on consortia schemas thinking that these should reflect a group's consensus of design criteria as well as have a disproportionate impact on the marketplace, as they are standard and implemented many times across a domain. These schemas are also all freely available.

The Sources

I examined schemas from the following organizations:

The Open Applications Group (OAGi)
The Open Travel Alliance (OTA)
Human Resources XML (HR-XML)
Chemical Industry Data Exchange (CIDX)
IMS Global Learning Consortium (IMS)
Association for Retail Technology Standards (ARTS)
Mortgage Industry Standards Maintenance Organization (MISMO)
World Wide Web Consortium (W3C, including mathML)
Global Justice XML
ACORD

There are many other consortia that could, and with infinite time would, be added to this analysis.

Tool Support

Many of the more mature tools have a high level of support for XML Schema features. In particular, IDEs that edit schemas have good support even for problematic features such as xsd:union elements. The problem with tool support comes in two forms. First, some have chosen not to support selected schema constructs. This amounts to a profile by design. Secondly, "best-of-breed" tools can offer support only for the most common schema constructs in their early releases. As the tool matures, it may plan to add support for additional features. I've blogged about tool support of XML Schema and shown references to a few code-generation tools and their publicly available claims of support.

The Results: A Profile of XML Schema

There is a clear tendency for simplicity based on the usage patterns of the 1,414 schemas tested. There are just six design features used in at least one third of the schemas. However 17 features only occur in 10 percent or less of the test cases. Many of the avoided constructs would only be used in very specific situations with a special need. In addition to simplicity, explicitness is a secondary pattern, reflected in the high level of element namespace qualification, the lack of mixed content models and abstract types, and in the overwhelming preference for xsd:sequence over xsd:all compositors.

Features Avoided (Occurring in Less than 10 Percent of the Schemas)

These XML Schema design features were either used minimally or not at all in the schemas tested.

xsd:all: The use of the xsd:all compositor. The clear preference is to use xsd:choiceor xsd:sequence instead.
Finalizing: The use of the @final or @finalDefault attributes. None of the tested schemas used these. The test cases seem to focus on enabling features rather than disabling ones such as these.
substitutionGroup: Allows elements to be substituted for other elements. While this feature was not used in the schemas tested, it is a common extension mechanism. The lack of substitutionGroups is probably due to the nature of the test cases. Open, standard domain consortia schemas may not need this feature, although organizations implementing them may need substitutionGroups for extensibility.
Uniqueness: The use of the unique element requiring the contents to be distinctive within its scope. I found this interesting, as the need for unique IDs was common; however, most often simple strings were used for that data. Perhaps uniqueness is enforced in the business layer of a data transfer between systems.
Qualified attributes: This is the use of attributeFormDefault="qualified". Almost none of the schema designers felt the need to do this, although the vast majority qualified the elements.
Keys: The use of the key and keyref elements.
Redefine: Using the redefine element to change the definition of an existing component. This feature is also among the least supported in tools. Consistently avoided.
Nillable: The use of the @nillable attribute, allowing the use of xsi:nil in the instance indicating the contents have a null value.
Block: The use of the @block attribute to disallow derivations.
complexType restriction: Restricting the content model of a complexType. Years ago at HR-XML, we looked at this feature to enable us to have a generalized data type that is constrained depending on the context of its usage. However, we found restricting complexTypes to be cumbersome, verbose, and not well supported at that time.
Abstract types: The use of abstract="true" on elements or types.
Mixed: Setting the attribute mixed="true" combines data and child elements in one place. Schema designers have clearly separated these concepts into distinct types.
Groups: The use of xsd:group is a way to define a group for later reuse. It may be that elements were simply referred to with "@ref" rather than put into groups.
Fixed values: The @fixed attribute on elements, attributes, or simpleTypes.
No @targetNamespace attribute: Having no @targetNamespace may be used in late binding of schemas. However, the vast majority of test cases used it. I have almost always seen it as a requirement in schema design guides.
No default namespace declared: Containing no "@xmlns" default namespace (no prefix). Again, this may be used in late binding. Default namespaces were consistently used in the test cases.
Default namespace not equal to @targetNamespace (PDF): This occurs when the default namespace value does not match the @targetNamespace. Not only did the vast majority of schemas have both a default and a @targetNamespace, but they were the same value. This reflects the tendency toward simplicity.

Features Used Frequently (Occurring in at Least One-Third of the Schemas)

The most commonly used XML Schema design features. Here again, simplicity rules. It is wise to begin schema design with this toolset.

Namespace qualified elements: Use elementFormDefault="qualified" for explicitness of the element namespaces.
xsd:sequence: The use of the xsd:sequence element. The most common compositor. I've recommended this over xsd:all because it leads to fewer ambiguous content models and its child elements occur in a predictable order.
complexType extension: Creating a type that extends another is a key reusability and extensibility point.
Anonymous types: These occur when types are created that are locally scoped and thus have no "@name" attribute. Very commonly done. Tools may prefer no anonymous types, but none I have tried was unable to accommodate them.
simpleType restriction: Derivation of a simpleType restricting its base type.
Enumerations: Enumerated values were the most frequently used construct.

Problems in the Middle

These XML Schema design features were commonly used but may have problematic tool support. It is a good idea to check with the tools you plan to use before adding these features to your design.

attributeGroup: A grouping of attributes by name for reusability, similar to xsd:group. This feature may have shown up in the test cases more than is actually used. First, one organization used them heavily, obscuring the fact that the others tended to avoid them. Second, the analysis searched for "xsd:attributeGroup" resulting in matches for both declarations and reuse or "@ref". So the actual number of these may be much smaller. My experience with tools was that attributeGroups were not hard to support, but simply weren't the highest priority.
xsd:choice: The use of the xsd:choice compositor. Some tool makers have expressed concern about this feature because it is not easily mapped to a programming construct. However, its usage is common.
Default values: Declaring default values for data in the XML instance. I've blogged about default values before.
xsd:union: The use of xsd:union to combine types in a declaration. I've found this is the least supported feature of XML Schema in tools.
Pattern: Uses regular expressions to subset strings, among other things.
Other facets: Includes facets other than pattern and enumeration, namely minInclusive, maxInclusive, maxInclusive, minExclusive, whitespace, fractionDigits, length, minLength, and maxLength. Again, tool support varies.
List types: The use of the xsd:list element. This feature only occurred in about 10 percent of the test cases. It is sometimes unsupported in tools and may be a cause of concern. I've heard many complaints from coders about programming to parse through and process list types. They much preferred the use of enumerations or separate data types.

A Note on Wildcards

These were in the middle of the list; however, I suspect that they are actually used much more frequently. Some of these consortia create a single wildcard extension element which is simply referred to (with "@ref") as needed. So the actual number of wildcard elements is lower than the usage of those elements.

Conclusion

Examining what schema designers are actually implementing can indeed reveal a usage profile of XML Schema. It is in this profile of practice that our five-year-old's personality emerges. The clearest message is one of simplicity. The most commonly used constructs involve merely creating reusable types, assembling them into sequences of elements, and augmenting them with enumerations. Many of the more complex features went unused. In addition, the test cases also reflected explicitness in their schemas, as evidenced in the avoidance of mixing or abstracting content and the qualifying of element form defaults. Adhering to the design patterns reflected in this usage profile will serve schema designers well.

Appendix: The Data

The data in these tables indicate the results of my research. They were all downloaded in early September 2006 from their respective websites (many of them are listed here). Figure 1 is a summary, Figure 2 indicates how many schemas contain the XML Schema design feature listed, and Figure 3 shows the number of times the feature occurred.

Summary of data.
Figure 1. Summary of data

Figure 2. Number of schemas using XML Schema features. (Click for full-size image)

Figure 3. Number of occurrences of XML Schema features. (Click for full-size image)

Figure Notes

A few duplicative schemas were removed from the analysis, such as the schema for schemas (XMLSchema.xsd), which was commonly distributed with many libraries. ACORD also offers no namespace equivalents of their schemas. For this analysis, the namespaced versions were used. In both the HR-XML and OAGi test files, the developer or "non-standalone" versions of the schemas were analyzed. While there are no substitutionGroups in the OAGi schemas, the global element design is intended to enable substitutions as an extension point. The W3C list of schemas includes mathML.