XML.com 
 Published on XML.com http://www.xml.com/pub/a/2000/09/27/schemas1.html
See this if you're having trouble printing code examples

 

The Beginning of the Endgame
By Rick Jelliffe
September 27, 2000

Table of Contents
A Look at the Changes in the Pre-CR W3C XML Schemas Draft

Now anyType Is Top

New Namespace Identifier

New Syntax for Type Declarations

Union Datatypes

Redefine

Error Codes and Post-Schema-Validation Infoset Contributions

Small Changes

Priority Feedback Items

A Look at the Changes in the Pre-CR W3C XML Schemas Draft

This article looks at those changes in the recent Pre-CR draft of W3C XML Schemas that will most effect developers and users. Requirements for data interchange with database systems have been important during W3C XML Schema's development. The recent changes also support markup languages and schema construction better.

The Candidate Recommendation (CR) drafts are slated to appear hot on the heels of the current drafts. The XML Schema Working Group was aware that authors, implementers, schema writers, and technical evaluators needed to know the most recent changes, especially since they include some syntax changes that will affect schemas using type derivation.

Now anyType Is Top

The old ur-type (i.e., the supertype) can now be used as a type in declarations. It has been given the friendlier and less literary name anyType to reflect that in XML the top-level type can be thought of as the union of every possible subtype. In the unified model underneath W3C XML Schemas, even brand new complex types are considered restrictions of the ur-type.

In the Datatypes draft, the schema declarations use the ur-type of all simple types anySimpleType. This ur-type has been provided to allow bootstrapping declarations of the primitive built-in types in the schema for schemas, and also to help reasoning about the simple type system. It is not available for use in schemas.

New Namespace Identifier

Type names belong to the built-in datatypes namespace; the drafts use the typical prefix xsd:. Schema declarations use the same namespace. The namespace URI for the current draft has changed from that used by previous drafts; this indicates that the usage of various elements has changed substantially. The schema namespace URI reference is now http://www.w3.org/2000/10/XMLSchema.

The attributes that W3C XML Schemas defines for use in document instances use the namespace URI reference http://www.w3.org/2000/10/XMLSchema-instance, and the drafts use the typical prefix xsi:.

New Syntax for Type Declarations

The most visible change in the new schema draft is syntactic changes to the elements complexType and simpleType.These do not alter functionality.

There are four changes to complexType:

Here's an example declaration in the old syntax, from the old Primer.

 <xsd:element name='internationalPrice'>
    <xsd:complexType base='xsd:decimal' derivedBy='extension'>
     <xsd:attribute name='currency' type='xsd:string' />
    </xsd:complexType>
</xsd:element>

And the equivalent declaration in the new syntax, taken from the Primer.

 <xsd:element name="internationalPrice"> 
  <xsd:complexType> 
   <xsd:simpleContent> 
    <xsd:extension base="xsd:decimal"> 
     <xsd:attribute name="currency" type="xsd:string" /> 
    </xsd:extension> 
   </xsd:simpleContent> 
  </xsd:complexType> 
 </xsd:element>

An example of a declaration for an empty element:

<xsd:element name="a"> 
 <xsd:complexType> 
  <xsd:complexContent> 
    <xsd:restriction base="xsd:anyType" /> 
  </xsd:complexContent> 
 </xsd:complexType> 
</xsd:element>

Alarming as this may be compared to the equivalent declaration using XML DTDs,

<!ELEMENT a EMPTY >

one can expect that empty elements will usually belong to types which also have attributes. XML Schemas allow many permutations of empty elements not available in DTDs.

Comment

This change is needed because W3C XML Schemas 1.0 doesn't allow the use of attributes to select the type information of elements. In order for the schema for schemas to represent the W3C XML Schema language well, the common XML idiom of using attributes to subtype the type identified by the element name cannot be supported. This involves more than idiom: by requiring the use of subelements rather than attributes, either the subelements must wrap the contents, or the subelements appear as the first siblings to select particular content sequences. Both of these solutions have problems of scale (combinatorial explosions when there are several attributes with different values) and effect (nested elements does not indicate which of their ancestors they relate to).

In my view, this makes W3C XML Schemas 1.0 not necessarily suitable for defining idiomatic, user-oriented markup languages. Start tags need to act sometimes as simple field names, such as data from a database, but other times more like a parameterized function, template-ized class definition, or shell commands with named arguments. W3C XML Schemas 1.0 undoubtedly fits database-style uses better than markup uses; or data where there needs to be ad hoc overriding of type facets.

However, it is important not to make too much of this point, as it just means that XML schemas are no more powerful than DTDs in this regard. As using attributes to select a subtype is not possible, and although we can still carry on using those attributes, in our schema we must actually use the looser type made from the union of the different subtypes. On the whole, W3C XML Schemas have a much wider range of options than DTDs, which may rapidly degenerate into use of an ANY declared content type in the same situation.

This is one area that I hope a W3C XML Schemas 1.1 or 2.0 may fix as soon as possible.

Union Datatypes

The big news on the datatypes front is the introduction of union dataypes, which join atomic types and list datatypes. This allows definition of types such as

<xsd:simpleType name="infinityToken"> 
 <xsd:restriction base="xsd:string"> 
   <xsd:enumeration value="infinity"/> 
 </xsd:restriction> 
</xsd:simpleType> 

<xsd:simpleType name="infNum"> 
  <xsd:union memberTypes="xsd:integer infinityToken"/> 
</xsd:simpleType>

The example declares a simple type called infNum that may have any integer value or the token infinity.

Union datatypes can be used for

It's not an error if the lexical spaces of two types used in a union overlap -- the first type specified in the memberTypes attribute wins. This is an exception to the general design rule in W3C XML Schemas that declarations should be position independent. Another useful exception to this is the new rule that, in instances to be validated, an element with a schemaLocation attribute for a namespace should not be preceded by elements or attributes which use names from that namespace. This allows streaming construction of schemas on an as-needed basis rather than requiring that validation be a separate pass (not counting validating the IDREFs and the keyref identity constraints, which can only be known at the end of the document, and thus require a separate pass).

Comment

Union types go a good way to overcoming the inflexibility in handling user-defined notations in previous drafts. W3C XML Schemas still lack any general mechanism for handling complex data in arbitrary notations: regular expressions can define some complex patterns, but there is no way to treat individual parts as lexical values and check them against any value space, for example.

Whitespace

To make union types possible, the issue of handling whitespace is important: when is the whitespace merely a token separator, and when is it part of the data? There are three kinds of whitespace normalization:

Whitespace normalization applies to data of any simple type, whether it is the value of an attribute or the contents of a simple-typed element. The new text in the draft for this is unclear, but it seems that the rule is that in all simple types except strings whitespace serves only to separate tokens, and in string types newlines and tabs are replaced by spaces but not collapsed.

Redefine

This is syntactic sugar which allows variant schemas to be created without having to cut and paste the original. Redefine is like a special form of include that allows extensions or refinements to be made to elements and attributes of a schema while maintaining the original namespaces. The extensions and refinements can only be those allowed in the schema being replaced.

It's a novel feature based on the requirements of evolving, or variant, schemas applicable to the same target namespace: you could use this to make your own version of XHTML enforcing in-house rules, in the same way as ISO HTML is a restricted version of HTML 4.

Redefine can be thought of as the successor to parameter entity declarations in the internal subset of an XML document: in XML, subsequent parameter entity re-declarations (such as re-declarations in the external subset) have no effect.

Error Codes and Post-Schema-Validation Infoset Contributions

XML has thrived because of the availability of the industry-standard APIs, DOM and SAX. These allow a variety of implementations with different characteristics and, very importantly, promotes synergy: you can beg, borrow, or steal (or do I mean import, redefine and include?) the documentation and applications from third-parties.

In the future we can expect schema-aware extensions or modules for DOM, XPath, and XSLT. XML Query in particular will rely on the results of processing an information set according to XML Schemas.

There are two results of schema-processing an XML document's information set:

Small Changes

Priority Feedback Items

Scattered throughout the drafts are notes called "Priority Feedback Items." These are meant to request of implementers a report if a feature has proved difficult to implement or has some flaw. Probably some of these will change after the review of feedback from the Candidate Release. Authors and trainers may want to treat these sections with a little more caution than other parts.

Here are some of the more important features labelled as PFIs:

Disclaimer The current drafts still warn that the Working Group may revise any aspect of the language before CR or after. The most obvious candidates are those marked Priority Feedback Items, however syntax or features that have changed recently are obviously less stable that those that have not changed for several drafts.

XML.com Copyright © 1998-2006 O'Reilly Media, Inc.