Extensibility, XML Vocabularies, and XML Schema
by David Orchard
|
Pages: 1, 2, 3, 4, 5, 6
Identifying and Extending Languages
Designing extensibility into languages typically results in systems that are more loosely coupled. Extensibility allows authors to change instances without going through a centralized authority, and may allow the centralized authority greater opportunities for versioning. The common characteristic of a compatible change is the use of extensibility.
A supreme example of the benefits of extensibility is HTML. The first version of HTML was designed for extensibility; it said that “unknown markup” may be encountered. An example of this in action is the addition of the IMG tag by the Mosaic browser team.
The first rule introduced in this article relating to extensibility is:
1. Allow Extensibility rule: Languages SHOULD be designed for extensibility.
A fundamental requirement for extensibility is to be able to determine the language of elements and attributes. XML namespaces [13] provide a mechanism for associating a URI with an XML element or attribute name, thus specifying the language of the name. This also serves to prevent name collisions.
HTML did not have the ability to distinguish between the namespaces of extensions. This meant that authors could produce the same element name but with different interpretations, and software would have no way of determining which interpretation is applicable. This is a great part of the motivation to move from HTML to the XML vocabulary of HTML, XHTML.
W3C XML Schema [14] provides a mechanism called a wildcard,
<xs:any>, for controlling
where elements from certain namespaces are allowed. The
wildcard indicates that elements in specified namespaces are
allowed in instance documents where the wildcard occurs. This
allows for later extension of a schema in a well-defined
manner. Consumers of extended documents can identify and,
depending upon the processing model, safely ignore the extensions
they don't understand.
<xs:any> uses the namespace attribute to control what
namespaces extension elements can come from. The most interesting
values for this attribute are: ##any, which means one can extend the schema
using an element from any possible namespace; ##other, which only allows extension
elements from namespaces other than the target namespace of the
schema; and ##targetnamespace,
which only allows extension elements from the target namespace of
the schema.
<xs:any>uses the processContents attribute to control how a
XML parser validates extended elements. Permissible methods include “lax” - validate any
elements from supported namespaces but ignore all other elements,
“strict” – validate all elements, and
“skip” – validate no elements. This article
recommends “lax” validation, as it is the most flexible
and is the typical choice for web services specifications.
The main goal of the "Must Ignore" pattern of extensibility is to allow backwards and forwards compatible changes to documents.
Example
Suppose that you have designed a language for handling
personal information. The personal information consists of a
“Name” element. The first version of the Name contains
a “first” and a “last”
element. Our preference
would be to have an extensibility style of ##any. An XML Schema
“name” type that uses this is:
<xs:complexType name="name">
<xs:sequence>
<xs:element name="first" type="xs:string"/>
<xs:element name="last" type="xs:string"/>
<xs:any namespace="##any" processContents="lax"
minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:anyAttribute/>
</xs:complexType>
Example 1 – A name schema using ##any for extensibility
However, the determinism constraint of XML Schema, described in more detail later, prevents this from working. The problem arises in a version when an optional element is followed by a wildcard. In this example, this occurs when an optional element is added and extensibility is still desired. This is a not-so-gentle introduction to the difference between extensibility and versioning. An optional middle name added into a subsequent version is a good example.
Consumers should be able to continue processing if they don’t understand an additional optional middle name, and we want to keep the extensibility point in the new version. We can write a schema that contains the optional middle name and a wildcard for extensibility. The following schema is roughly what is desired using wildcards, but it is illegal because of the determinism constraint:
<xs:complexType name="name">
<xs:sequence>
<xs:element name="first" type="xs:string"/>
<xs:element name="last" type="xs:string"/>
<xs:element name="middle" type="xs:string" minOccurs="0"/>
<xs:any namespace="##any" processContents="lax"
minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:anyAttribute/>
</xs:complexType>
Example 2 – An illegal schema type for a backwards compatible version of the name schema
Since the above pattern does not work, we need to create a design pattern than enables roughly the equivalent in order to achieve the original goals.
All Compatible Changes in New Namespaces
The most common solution is to put all new components, either extensions or compatible versions, in a namespace different than the original namespace, that is the "##other" namespace option of wildcards. We show a complete schema instance:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.openuri.org/name/1"
xmlns:name="http://www.openuri.org/name/1">
<xs:complexType name="name">
<xs:sequence>
<xs:element name="first" type="xs:string"/>
<xs:element name="last" type="xs:string"/>
<xs:element name="middle" type="xs:string" minOccurs="0"/>
<xs:any namespace="##other" processContents="lax"
minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:anyAttribute/>
</xs:complexType>
</xs:schema>
Example 3 – New components in new namespace(s) schema Version 1
The language designer and third parties can now only use different namespaces for their versions. For allowing new extensions in the same namespace, the author must create an extension type that allows extensions in the same namespace. The extension type should be used only for future compatible extensions in the same namespace. We need two more rules to allow proper versioning of XML language definitions. The reader is urged to keep in mind that all of these restrictions on behavior are a consequence of the W3C's XML Schema design and are unnecessary in other schema languages like RelaxNG. First the rule for namespaces:
2. Allow Extensions in Other Namespace rule: The extensibility point SHOULD at least allow for extension in other namespaces.
The rule for allowing extensibility:
3. Full Extensibility rule: All XML Elements SHOULD allow for element extensibility after element definitions, and allow any attributes.
In general, an extension can be defined by a new specification that makes a normative reference to the earlier specification and then defines the new element. No permission should be needed from the authors of the specification to make such an extension. In fact, the major design point of XML namespaces is to allow decentralized extensions. The corollary is that permission is required for extensions in the same namespace. A namespace has an owner; non-owners changing the meaning of something can be harmful.
Attribute extensions do not have non-determinism issues because the attributes are always unordered and the model group for attributes uses a different mechanism for associating attributes with schema types than the model group for elements.
Understanding Extensions
Ideally, producers should be able to extend existing XML documents with new elements without consumers having to change existing implementations. Extensibility is one step toward this goal, but achieving compatibility also requires a processing model for the extensions. The behavior of software when it encounters an extension should be clear. For this, we introduce the next rule:
4.Provide Processing Model rule: Languages SHOULD specify a processing model for dealing with extensions.
The simplest processing model that enables compatible changes is to ignore content that is not understood. This rule is:
5. Must Ignore rule: Document consumers MUST ignore any XML attributes or elements in a valid XML document that they do not recognize.
This rule does not require that the elements be physically removed; only ignored for processing purposes. There is a great deal of historic usage of the Must Ignore rule. HTML 1, 2 and 3.2 follow the Must Ignore rule as they specify that any unknown start tags or end tags are mapped to nothing during tokenization. HTTP 1.1 [7] specifies that a consumer should ignore any headers it doesn't understand: "Unrecognized header fields SHOULD be ignored by the recipient and MUST be forwarded by transparent proxies." The Must Ignore rule for XML was first standardized in the WebDAV specification RFC 2518 [6] section 14 and later separately published as the Flexible XML Processing Profile [3].
There are two broad types of Must Ignore rules for dealing with extensions, either ignoring the entire tree or just the unknown element. The rule for ignoring the entire tree is:
6. Must Ignore All rule: The Must Ignore rule applies to unrecognized elements and their descendants in data-oriented formats.
For example, if a message is received with unrecognized elements in a SOAP header block, they must be ignored unless marked as “Must Understand” (see Rule 10 below). Note that this rule is not broken if the unrecognized elements are written to a log file. That is, “ignored” doesn’t mean that unrecognized extensions can’t be processed; only that they can’t be the grounds for failure to process.
Other applications may need a different rule as the application will typically want to retain the content of an unknown element, perhaps for display purposes. The rule for ignoring the element only is:
7. Must Ignore Container rule: The Must Ignore rule applies only to unrecognized elements in presentation-oriented formats.
This retains the element descendants in the processing model so that they can still affect interpretation of the document, such as for display purposes.
Ignoring content is a simple solution to the problem of substitution. In order to achieve a compatible evolution, the newer instances of a language must be transformable (or substitutable) into older instances. Object systems typically call this “polymorphism”, where a new type can behave as the old type.
Other substitution models have been successfully deployed. One such model is a fallback model, where alternate elements are provided if the consumer does not understand the extension. XSLT 2.0 provides such a model. Another model is that a transform from the new type to the old type is made available, either by value or reference.
As desirable as compatible evolution often is, sometimes a language may not want to allow it. In this model, a consumer will generate a fault if it finds a component it doesn’t understand. An example might be a security specification where a consumer must understand each and every extension. This suffers from the significant drawback that it does not allow compatible changes to occur in the language, as any changes require both consumer and producer to change.