Designing Extensible, Versionable XML Formats
by Dare Obasanjo
|
Pages: 1, 2, 3
Using XML Schema to Design an Extensible XML Format
W3C XML Schema
provides a number of features that promote extensibility in XML
vocabularies such as wildcards, substitution groups, and
xsi:type. I've written about a number of techniques
for adding extensibility to XML formats using W3C XML Schema in my
article,
W3C
XML Schema Design Patterns: Dealing With Change. So as not to
repeat myself, I will merely provide a brief overview of the
various options described in my previous article.
1. Using Wildcards to create open-content models: The wildcards xs:any and xs:anyAttribute are
used to allow the occurrence of elements and attributes from
specified namespaces into a content model. Wildcards allow schema
authors to enable extensibility of the content model while
maintaining a degree of control over the occurrence of elements
and attributes. The most important attributes for wildcards are
namespace and processContents. The
namespace attribute is used to specify the namespace
from which elements or attributes the wildcard matches can come
from. The processContents attribute is used to
specify if and how the XML content matched by the wildcard should
be validated.
2. Gaining flexibility from Substitution Groups and Abstract Elements: A substitution group contains elements that can appear interchangeably in an XML instance document in a manner reminiscent of subtype polymorphism in OOP languages. Elements in a substitution group must be of the same type or have types that are members of the same type hierarchy.
An element declaration that is marked abstract indicates that a member of its substitution group must appear in its place in the instance document. A schema designer can build an extensibility point into a schema by defining an abstract element, which must be replaced by subtypes defined as extensions that are members of the abstract element's substitution group.
3. Runtime polymorphism via xsi:type and Abstract Types:
Abstract types are complex type definitions that have true as the
value of their abstract attribute, which indicates
elements in an instance document cannot be of that type, but
instead must be replaced by another type derived either by
restriction or extension. The
xsi:type
attribute can be placed on an element in an XML instance document
to change its type as long as the new type is in the same type
hierarchy as the original type of the element. Although it's not
necessary to use abstract types in conjunction with
xsi:type, if a generic format is being created for
which most users will create domain specific extensions, then they
provide some benefit.
4. Using xs:redefine to update type definitions: The
types in a schema can be redefined in a process whereby the type
effectively derives from itself. xs:redefine, used
for redefinition, performs two tasks. The first is to act as an
xs:include element by bringing in declarations and
definitions from another schema document and making them available
as part of the current target namespace. The included declarations
and types must be from a schema with the same target namespace, or
it must have no namespace.
Second, types can be redefined in a
manner similar to type derivation with the new definition
replacing the old one. Type redefinition is pervasive
because it not only affects elements in the including schema but
also those in the included schema as well. Thus all references to
the original type in both schemas refer to the redefined type,
while the original type definition is overshadowed. Using
xs:redefine doesn't provide extensibility in the
traditional sense but instead allows one to effectively alter the
definitions of types in a given schema.
Guidelines for Designing Versionable XML Formats
The following guidelines for designing XML formats in a way that makes them resilient in the face of changes in subsequent versions are also modified from those in David Orchard's Versioning XML Vocabularies article.
If the next version of a format is backward compatible with previous versions, then the old namespace name must be used in conjunction with XML's extensibility model.
A new namespace name must be used when backward compatibility is not permitted. That is, software must break if it does not understand the new language components.
Formats should specify a mustUnderstand model for dealing with backward-incompatible
changes to the format that don't change the namespace name.
The following discussions explore each of the above guidelines in more detail.
Why the same namespace name should be used for backward-compatible versions of a format.
The namespace name of an element or attribute is part of its identity. The name of an element or attribute is syntactically in the form of a qualified name, also known as a QName. The QName is an XML name, called the local name, optionally preceded by another XML name, called the prefix, and a colon (':') character. The prefix of a qualified name must have been mapped to a namespace URI through an in-scope namespace declaration, mapping the prefix to the namespace URI.
Although QNames are important mnemonic guides to determining what namespace the elements and attributes within a document are derived from, they are rarely important to XML processors. For example, the following three XML documents would be treated identically by a range of XML technologies including, of course, XML schema validators.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:complexType id="123" name="fooType"/>
</xs:schema>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:complexType id="123" name="fooType"/>
</xsd:schema>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
<complexType id="123" name="fooType"/>
</schema>
The W3C XML Path Language recommendation describes an expanded name as a pair consisting of a namespace name and a local name. A universal name is an alternate term coined by James Clark to describe the same concept. To many XML applications, the universal name of the elements and attributes in an XML document is what is important, and not the values of the prefixes used in specific QNames.
This means that changing the namespace name of an XML vocabulary renames all the elements and global attributes to a namespace-aware XML application such as XPath, XSLT, XML parsers, and a host of other technologies. If the new version of the format is backward compatible with the original version of the format, then elements and global attributes should retain the same names so as not to break namespace-aware applications that consume the format.
Why a new namespace name must be used when backward compatibility is not permitted.
As mentioned in the previous section, changing the namespace name of an XML vocabulary renames all the elements and global attributes in the vocabulary. In certain cases, changes to an XML format can make it differ drastically from one version to the next in a backward-incompatible manner. In such cases, it is best to change the namespace name so namespace-aware XML applications rightly fail to identify the new version of the format as being the same as the original, and thus reject documents in the new format.
Why XML formats should specify a mustUnderstand model for
dealing with backward-incompatible changes to the format.
If a newer version of an XML format is not backward compatible with its predecessor, but does not use a new namespace name, then there should be a way to tell consumers of the format to error on changed or new constructs that they do not understand.
A simple solution is for the
format to provide a version number, which on its root element can be tested by consumers before processing the XML
document. In this case the mustUnderstand model is that the
consumer must understand all elements from the target
namespace of the format if it supports the version number
specified on the root element.
In cases where new elements are added to the format that are
not backward compatible with older versions of the format, it
may be best for such elements to be tagged with a
mustUnderstand attribute. Doing this ensures that
there is still some degree of interoperability, because as long as
the producer generates documents in the new format that do not
contain the new constructs then all is well.
For example, imagine
an XML-based query language that adds update constructs in a newer
version (e.g. create, replace,
update, delete, etc.). In such a
situation, a producer of the format that has upgraded to the newer
version can still generate documents that contain the original
query constructs in the language without worrying about
compatibility. However, if the producer is generating documents
using the new constructs it adorns them with
mustUnderstand attributes whose value is "true," which
indicates to older clients that they are to fail if they don't
understand how to perform a delete (for example).
It should be noted that the mustUnderstand construct
does not have to be an attribute. A limitation of using an
attribute is that it isn't easy to use it to mark a new
attribute as having to be understood. Another drawback of using an
attribute is that it has to be repeated on each occurrence of an
element that must be understood. This is needlessly repetitive if
that element appears multiple times in a document. Another
approach could be specifying a mustUnderstand element
that identifies which new items must be understood.