XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

The Beginning of the Endgame
by Rick Jelliffe | Pages: 1, 2

Union Datatypes

The big news on the datatypes front is the introduction of union dataypes, which join atomic types and list datatypes. This allows definition of types such as

<xsd:simpleType name="infinityToken"> 
 <xsd:restriction base="xsd:string"> 
   <xsd:enumeration value="infinity"/> 
 </xsd:restriction> 
</xsd:simpleType> 

<xsd:simpleType name="infNum"> 
  <xsd:union memberTypes="xsd:integer infinityToken"/> 
</xsd:simpleType>

The example declares a simple type called infNum that may have any integer value or the token infinity.

Union datatypes can be used for

  • extensible lists of values (open, controlled vocabularies);
  • modular construction of lists of values;
  • expressing exceptional values using tokens of a different lexical pattern to the normal data, such as the example above;
  • richer controlled vocabularies where it is desirable to cater to the different conventions people may use for the same thing: for example, to allow yes or no, 1 or 0;

It's not an error if the lexical spaces of two types used in a union overlap -- the first type specified in the memberTypes attribute wins. This is an exception to the general design rule in W3C XML Schemas that declarations should be position independent. Another useful exception to this is the new rule that, in instances to be validated, an element with a schemaLocation attribute for a namespace should not be preceded by elements or attributes which use names from that namespace. This allows streaming construction of schemas on an as-needed basis rather than requiring that validation be a separate pass (not counting validating the IDREFs and the keyref identity constraints, which can only be known at the end of the document, and thus require a separate pass).

Comment

Union types go a good way to overcoming the inflexibility in handling user-defined notations in previous drafts. W3C XML Schemas still lack any general mechanism for handling complex data in arbitrary notations: regular expressions can define some complex patterns, but there is no way to treat individual parts as lexical values and check them against any value space, for example.

Whitespace

To make union types possible, the issue of handling whitespace is important: when is the whitespace merely a token separator, and when is it part of the data? There are three kinds of whitespace normalization:

  • preserve, keeps things as they are (but note that normal XML processing may have altered some whitespace from the original markup);
  • replace, ASCII whitespace characters are each replaced by a single space: this is used in types derived from simple type string;
  • collapse, the same as replace but leading and trailing whitespace is stripped and sequences of whitespace are replaced by single spaces. This is what usually happens.

Whitespace normalization applies to data of any simple type, whether it is the value of an attribute or the contents of a simple-typed element. The new text in the draft for this is unclear, but it seems that the rule is that in all simple types except strings whitespace serves only to separate tokens, and in string types newlines and tabs are replaced by spaces but not collapsed.

Redefine

This is syntactic sugar which allows variant schemas to be created without having to cut and paste the original. Redefine is like a special form of include that allows extensions or refinements to be made to elements and attributes of a schema while maintaining the original namespaces. The extensions and refinements can only be those allowed in the schema being replaced.

It's a novel feature based on the requirements of evolving, or variant, schemas applicable to the same target namespace: you could use this to make your own version of XHTML enforcing in-house rules, in the same way as ISO HTML is a restricted version of HTML 4.

Redefine can be thought of as the successor to parameter entity declarations in the internal subset of an XML document: in XML, subsequent parameter entity re-declarations (such as re-declarations in the external subset) have no effect.

Error Codes and Post-Schema-Validation Infoset Contributions

XML has thrived because of the availability of the industry-standard APIs, DOM and SAX. These allow a variety of implementations with different characteristics and, very importantly, promotes synergy: you can beg, borrow, or steal (or do I mean import, redefine and include?) the documentation and applications from third-parties.

In the future we can expect schema-aware extensions or modules for DOM, XPath, and XSLT. XML Query in particular will rely on the results of processing an information set according to XML Schemas.

There are two results of schema-processing an XML document's information set:

  • First, in the new Appendix D of the Structures document, there is a list of all the possible validity errors. With these codes, and some future standard error API, error-reporting and repair tools can be developed independent of the schema processor implementation.

  • Second, the new drafts flesh out the additions to the information set that a schema-aware processor can make. In the previous drafts, knowing the type of an element or attribute was the main contribution. The new draft corrals other contributions from processing into the infoset, such as default attribute values.

Small Changes

  • The old term element equivalence class has been replaced by the clearer substitution groups.
  • The defaults for the attributes minOccurs and maxOccurs are now both "1" in all cases; previously it varied depending on its element.

  • The values allowed for simple type boolean are true or false. The values 0 and 1 are no longer allowed.

Priority Feedback Items

Scattered throughout the drafts are notes called "Priority Feedback Items." These are meant to request of implementers a report if a feature has proved difficult to implement or has some flaw. Probably some of these will change after the review of feedback from the Candidate Release. Authors and trainers may want to treat these sections with a little more caution than other parts.

Here are some of the more important features labelled as PFIs:

  • xsi:null is an attribute available for use in document instances to specify that an element (or, rather, that the content of an element) is null, which is a property rather than a value. There is no way to specify in an instance that an attribute is null: perhaps null can be inferred from the absence of the attribute. The issue here is that null is a database concept: what relevance does it have for data interchange? If one view is that null has no place, another is that nulls represent a particular case of the more general phenomenon of exceptions to data, which should be supported.
  • Local names are bindings between an element name and a type scoped to a parent type. For example a mouse subelement of a computer element could have a different type than a mouse subelement of a catfood element. The issue is whether it's good for one name to have different meanings in a document; some say this forces XML processing systems to be more complex, since they require knowledge of the parent or type. Others say it reduces complexity because it removes a distinction between attributes and elements. The feature has been requested to aid mappings to database schemas and programming languages.
  • The top-most element of a schema, xsd:schema, has attributes available for setting overrideable defaults that apply across the rest of the schema:
    • elementFormDefault and attributeFormDefault determines whether a local element or attribute name is a qualified name (qname) or not (a qualified name is one associated with a namespace: it's not the same as prefixed name because a name with no prefix could still be qualified with a default namespace declaration, i.e., the vanilla xmlns attribute),
    • blockDefault determines whether derived types can be used in place of the current type in instances,
    • finalDefault determines the default for controlling whether subsequent schemas can derive types based on the ones defined in the current schema, and
    • targetNamespace sets the URI reference used in universal names.

    The main issue is what the best default values should be. The issue of defaulting also requires general design decisions concerning the scope of the default values: should a top-level default apply to imported declarations or included schemas? Do specifiable top-level defaults actually simplify schemas, or do they make them more complicated to understand?

  • Indeed, the whole area of how to construct schemas from existing schemas is one where the Working Group requires the results of field trials before ultimately recommending the current methods. Note that the type derivation mechanism is not a PFI: instead the elements involved are import, include, and the new redefine.
  • Turning to datatypes, the current design uses arbitrary-precision decimal numbers (including integers), rather than setting a fixed precision or providing minimum-required precision. The availability of libraries for arbitrary-precision decimal operations (bignums) was important here. Furthermore, there are many advantages in requiring that all conforming implementations provide exactly the same validation result on the same documents: so implementation-specific precision, while convenient for implementers, may be counter-productive.

Disclaimer The current drafts still warn that the Working Group may revise any aspect of the language before CR or after. The most obvious candidates are those marked Priority Feedback Items, however syntax or features that have changed recently are obviously less stable that those that have not changed for several drafts.