The Beginning of the Endgame

September 27, 2000

Table of Contents

•A Look at the Changes in the Pre-CR W3C XML Schemas Draft

•Now anyType Is Top

•New Namespace Identifier

•New Syntax for Type Declarations

•Union Datatypes

•Redefine

•Error Codes and Post-Schema-Validation Infoset Contributions

•Small Changes

•Priority Feedback Items

A Look at the Changes in the Pre-CR W3C XML Schemas Draft

This article looks at those changes in the recent Pre-CR draft of W3C XML Schemas that will most effect developers and users. Requirements for data interchange with database systems have been important during W3C XML Schema's development. The recent changes also support markup languages and schema construction better.

The Candidate Recommendation (CR) drafts are slated to appear hot on the heels of the current drafts. The XML Schema Working Group was aware that authors, implementers, schema writers, and technical evaluators needed to know the most recent changes, especially since they include some syntax changes that will affect schemas using type derivation.

Now `anyType` Is Top

The old ur-type (i.e., the supertype) can now be used as a type in declarations. It has been given the friendlier and less literary name anyType to reflect that in XML the top-level type can be thought of as the union of every possible subtype. In the unified model underneath W3C XML Schemas, even brand new complex types are considered restrictions of the ur-type.

In the Datatypes draft, the schema declarations use the ur-type of all simple types anySimpleType. This ur-type has been provided to allow bootstrapping declarations of the primitive built-in types in the schema for schemas, and also to help reasoning about the simple type system. It is not available for use in schemas.

New Namespace Identifier

Type names belong to the built-in datatypes namespace; the drafts use the typical prefix xsd:. Schema declarations use the same namespace. The namespace URI for the current draft has changed from that used by previous drafts; this indicates that the usage of various elements has changed substantially. The schema namespace URI reference is now http://www.w3.org/2000/10/XMLSchema.

The attributes that W3C XML Schemas defines for use in document instances use the namespace URI reference http://www.w3.org/2000/10/XMLSchema-instance, and the drafts use the typical prefix xsi:.

New Syntax for Type Declarations

The most visible change in the new schema draft is syntactic changes to the elements complexType and simpleType.These do not alter functionality.

There are four changes to complexType:

the old content attribute has been replace by various elements and attributes, in particular simpleContent and complexContent;
the old derivedBy attribute, used to specify whether a type is derived by extension or restriction from a base type, has been replaced by subelements extension and restriction, available on complexContent and simpleContent; similarly the attribute base has been moved (this change also applies to simpleType);
a Boolean attribute mixed is now available on complexType and complexContent to specify that general data content is allowed in addition to elements;
empty elements can be defined by complexTypes with no values.

Here's an example declaration in the old syntax, from the old Primer.


 <xsd:element name='internationalPrice'>

    <xsd:complexType base='xsd:decimal' derivedBy='extension'>

     <xsd:attribute name='currency' type='xsd:string' />

    </xsd:complexType>

</xsd:element>

And the equivalent declaration in the new syntax, taken from the Primer.


 <xsd:element name="internationalPrice"> 

  <xsd:complexType> 

   <xsd:simpleContent> 

    <xsd:extension base="xsd:decimal"> 

     <xsd:attribute name="currency" type="xsd:string" /> 

    </xsd:extension> 

   </xsd:simpleContent> 

  </xsd:complexType> 

 </xsd:element>

An example of a declaration for an empty element:


<xsd:element name="a"> 

 <xsd:complexType> 

  <xsd:complexContent> 

    <xsd:restriction base="xsd:anyType" /> 

  </xsd:complexContent> 

 </xsd:complexType> 

</xsd:element>

Alarming as this may be compared to the equivalent declaration using XML DTDs,


<!ELEMENT a EMPTY >

one can expect that empty elements will usually belong to types which also have attributes. XML Schemas allow many permutations of empty elements not available in DTDs.

Comment

This change is needed because W3C XML Schemas 1.0 doesn't allow the use of attributes to select the type information of elements. In order for the schema for schemas to represent the W3C XML Schema language well, the common XML idiom of using attributes to subtype the type identified by the element name cannot be supported. This involves more than idiom: by requiring the use of subelements rather than attributes, either the subelements must wrap the contents, or the subelements appear as the first siblings to select particular content sequences. Both of these solutions have problems of scale (combinatorial explosions when there are several attributes with different values) and effect (nested elements does not indicate which of their ancestors they relate to).

In my view, this makes W3C XML Schemas 1.0 not necessarily suitable for defining idiomatic, user-oriented markup languages. Start tags need to act sometimes as simple field names, such as data from a database, but other times more like a parameterized function, template-ized class definition, or shell commands with named arguments. W3C XML Schemas 1.0 undoubtedly fits database-style uses better than markup uses; or data where there needs to be ad hoc overriding of type facets.

However, it is important not to make too much of this point, as it just means that XML schemas are no more powerful than DTDs in this regard. As using attributes to select a subtype is not possible, and although we can still carry on using those attributes, in our schema we must actually use the looser type made from the union of the different subtypes. On the whole, W3C XML Schemas have a much wider range of options than DTDs, which may rapidly degenerate into use of an ANY declared content type in the same situation.

This is one area that I hope a W3C XML Schemas 1.1 or 2.0 may fix as soon as possible.

Union Datatypes

The big news on the datatypes front is the introduction of union dataypes, which join atomic types and list datatypes. This allows definition of types such as


<xsd:simpleType name="infinityToken"> 

 <xsd:restriction base="xsd:string"> 

   <xsd:enumeration value="infinity"/> 

 </xsd:restriction> 

</xsd:simpleType> 



<xsd:simpleType name="infNum"> 

  <xsd:union memberTypes="xsd:integer infinityToken"/> 

</xsd:simpleType>

The example declares a simple type called infNum that may have any integer value or the token infinity.

Union datatypes can be used for

extensible lists of values (open, controlled vocabularies);
modular construction of lists of values;
expressing exceptional values using tokens of a different lexical pattern to the normal data, such as the example above;
richer controlled vocabularies where it is desirable to cater to the different conventions people may use for the same thing: for example, to allow yes or no, 1 or 0;

It's not an error if the lexical spaces of two types used in a union overlap -- the first type specified in the memberTypes attribute wins. This is an exception to the general design rule in W3C XML Schemas that declarations should be position independent. Another useful exception to this is the new rule that, in instances to be validated, an element with a schemaLocation attribute for a namespace should not be preceded by elements or attributes which use names from that namespace. This allows streaming construction of schemas on an as-needed basis rather than requiring that validation be a separate pass (not counting validating the IDREFs and the keyref identity constraints, which can only be known at the end of the document, and thus require a separate pass).

Comment

Union types go a good way to overcoming the inflexibility in handling user-defined notations in previous drafts. W3C XML Schemas still lack any general mechanism for handling complex data in arbitrary notations: regular expressions can define some complex patterns, but there is no way to treat individual parts as lexical values and check them against any value space, for example.

Whitespace

To make union types possible, the issue of handling whitespace is important: when is the whitespace merely a token separator, and when is it part of the data? There are three kinds of whitespace normalization:

preserve, keeps things as they are (but note that normal XML processing may have altered some whitespace from the original markup);
replace, ASCII whitespace characters are each replaced by a single space: this is used in types derived from simple type string;
collapse, the same as replace but leading and trailing whitespace is stripped and sequences of whitespace are replaced by single spaces. This is what usually happens.

Whitespace normalization applies to data of any simple type, whether it is the value of an attribute or the contents of a simple-typed element. The new text in the draft for this is unclear, but it seems that the rule is that in all simple types except strings whitespace serves only to separate tokens, and in string types newlines and tabs are replaced by spaces but not collapsed.

Redefine

This is syntactic sugar which allows variant schemas to be created without having to cut and paste the original. Redefine is like a special form of include that allows extensions or refinements to be made to elements and attributes of a schema while maintaining the original namespaces. The extensions and refinements can only be those allowed in the schema being replaced.

It's a novel feature based on the requirements of evolving, or variant, schemas applicable to the same target namespace: you could use this to make your own version of XHTML enforcing in-house rules, in the same way as ISO HTML is a restricted version of HTML 4.

Redefine can be thought of as the successor to parameter entity declarations in the internal subset of an XML document: in XML, subsequent parameter entity re-declarations (such as re-declarations in the external subset) have no effect.

Error Codes and Post-Schema-Validation Infoset Contributions

XML has thrived because of the availability of the industry-standard APIs, DOM and SAX. These allow a variety of implementations with different characteristics and, very importantly, promotes synergy: you can beg, borrow, or steal (or do I mean import, redefine and include?) the documentation and applications from third-parties.

In the future we can expect schema-aware extensions or modules for DOM, XPath, and XSLT. XML Query in particular will rely on the results of processing an information set according to XML Schemas.

There are two results of schema-processing an XML document's information set:

First, in the new Appendix D of the Structures document, there is a list of all the possible validity errors. With these codes, and some future standard error API, error-reporting and repair tools can be developed independent of the schema processor implementation.
Second, the new drafts flesh out the additions to the information set that a schema-aware processor can make. In the previous drafts, knowing the type of an element or attribute was the main contribution. The new draft corrals other contributions from processing into the infoset, such as default attribute values.

Small Changes

The old term element equivalence class has been replaced by the clearer substitution groups.
The defaults for the attributes minOccurs and maxOccurs are now both "1" in all cases; previously it varied depending on its element.
The values allowed for simple type boolean are true or false. The values 0 and 1 are no longer allowed.

Priority Feedback Items

Scattered throughout the drafts are notes called "Priority Feedback Items." These are meant to request of implementers a report if a feature has proved difficult to implement or has some flaw. Probably some of these will change after the review of feedback from the Candidate Release. Authors and trainers may want to treat these sections with a little more caution than other parts.

Here are some of the more important features labelled as PFIs:

xsi:null is an attribute available for use in document instances to specify that an element (or, rather, that the content of an element) is null, which is a property rather than a value. There is no way to specify in an instance that an attribute is null: perhaps null can be inferred from the absence of the attribute. The issue here is that null is a database concept: what relevance does it have for data interchange? If one view is that null has no place, another is that nulls represent a particular case of the more general phenomenon of exceptions to data, which should be supported.
Local names are bindings between an element name and a type scoped to a parent type. For example a mouse subelement of a computer element could have a different type than a mouse subelement of a catfood element. The issue is whether it's good for one name to have different meanings in a document; some say this forces XML processing systems to be more complex, since they require knowledge of the parent or type. Others say it reduces complexity because it removes a distinction between attributes and elements. The feature has been requested to aid mappings to database schemas and programming languages.
The top-most element of a schema, xsd:schema, has attributes available for setting overrideable defaults that apply across the rest of the schema:
- elementFormDefault and attributeFormDefault determines whether a local element or attribute name is a qualified name (qname) or not (a qualified name is one associated with a namespace: it's not the same as prefixed name because a name with no prefix could still be qualified with a default namespace declaration, i.e., the vanilla xmlns attribute),
- blockDefault determines whether derived types can be used in place of the current type in instances,
- finalDefault determines the default for controlling whether subsequent schemas can derive types based on the ones defined in the current schema, and
- targetNamespace sets the URI reference used in universal names.
The main issue is what the best default values should be. The issue of defaulting also requires general design decisions concerning the scope of the default values: should a top-level default apply to imported declarations or included schemas? Do specifiable top-level defaults actually simplify schemas, or do they make them more complicated to understand?
Indeed, the whole area of how to construct schemas from existing schemas is one where the Working Group requires the results of field trials before ultimately recommending the current methods. Note that the type derivation mechanism is not a PFI: instead the elements involved are import, include, and the new redefine.
Turning to datatypes, the current design uses arbitrary-precision decimal numbers (including integers), rather than setting a fixed precision or providing minimum-required precision. The availability of libraries for arbitrary-precision decimal operations (bignums) was important here. Furthermore, there are many advantages in requiring that all conforming implementations provide exactly the same validation result on the same documents: so implementation-specific precision, while convenient for implementers, may be counter-productive.

Disclaimer The current drafts still warn that the Working Group may revise any aspect of the language before CR or after. The most obvious candidates are those marked Priority Feedback Items, however syntax or features that have changed recently are obviously less stable that those that have not changed for several drafts.