Not My Type: Sizing Up W3C XML Schema Primitives
Continuing our occasional series of opinion pieces from members of the XML community, Amy Lewis takes a hard look at W3C XML Schema datatypes.
Since the application of XML to data representation first gained public visibility, there has been a movement to enhance its type system beyond that originally provided by DTD. Several attempts were made (SOX, XML Data and XML Data Reduced, Datatypes for DTDs, and others) before the W3C handed the problem to the XML Schema Working Group.
What is the goal of data type definitions for XML? For one thing, it establishes "strong typing" in XML in a fashion that corresponds with strong typing in programming languages. Various commercial interests have been vocal supporters of strong typing in XML because they see typed generic data representation as their best hope for interoperability and increased automation. With typing in schemas extended into the textual content of simple types, and not just the structural content of complex types, businesses can enforce contracts for data exchange. In other words, strong typing enables electronic commerce.
To phrase it a little differently, the data types defined in DTDs were considered inadequate to support the requirements of electronic commerce or, more generally, of commercially reliable electronic information exchange.
The publication of W3C XML Schema (or WXS), in which one half of the specification was devoted to the definition of a type library (part two), seemed to resolve the problem. Certainly, with forty-four built-in data types, nineteen of them primitive, it seemed at first glance to cover the field. The increasing visibility of WXS and the efforts to layer additional specifications on top of it -- XML Query, the PSVI, data types in XPath 2.0, typing in web services -- have begun to raise serious questions about WXS part two, even among proponents of strong types, including the author of this article.
There are two fundamental problems with WXS datatyping. The first is its design: it's not a type system -- there is no system -- and not even a type collection. Rather, it's a collection of collections of types with no coherent or consistent set of interrelations. The second problem is a single sentence in the specification: "Primitive datatypes can only be added by revisions to this specification". This sentence exists because of the design problem; lacking a concept for what a primitive data type is, the only way to define new types is by appeal to authority. The data type library is wholly inextensible, internally inconsistent, bloated in and incomplete for most application domains.
Not a type system
The data type library defined in WXS part two is not a type system. It's not possible to examine the built-in types and determine the guiding principles which dictated which types were to be defined and which were to be defined as primitives.
Consider a contrasting example. The type system used by C and related languages is clearly based on bit patterns and register sizes. The bit pattern 10011001 fits into registers of a certain size, but has different meaning based on its type: character, unsigned or signed byte. The type assigned to a bit pattern determines certain behaviors. If the above pattern is X, and Y is the bit pattern 00010001, then X > Y if both are unsigned bytes, and X < Y if both are signed bytes. The same bit patterns may represent character (or strings of characters), integers of various sizes, and floating point numbers (again with various constraints), but the fundamental limitation is the number of bits that can be stuffed into a register. By interpreting the identical bits in different fashions, the languages achieve different effects.
One mandate for WXS was that it should reproduce the limited type system of the DTD plus the namespace extensions. It stands to reason that, given the definition of QName and NCName in the namespaces specification and Name in the original XML 1.0 specification, these types would be found in some rational relationship to one another. In the WXS definition, NCName is a subtype of Name, which is a subtype of token, which is a subtype of normalizedString, which is a subtype of string, which is a primitive type. However, QName is also a primitive type, implying that it is not a string, not a normalizedString, not a token, and not a name, even though it is composed lexically of NCName + : + NCName.
WXS also represents numbers of various sorts. Given the requirement to support decimal, integer, float, and double, which should be considered primitive types, and which derived? What criterion should be used for derivation? Your answer should allow for the further derivation of various bounded-range integers, but needn't worry about number systems solely of interest to fusty ivory-tower academics. Data typing isn't particularly useful in science, of course.
Nine times too many
Why is anyURI a primitive type? Why are there nine separate and unrelated primitive types all concerned with measurement of time? Even though early drafts of WXS included three time instant measuring types (dateTime, date, and time, which are not, despite lexical and conceptual overlap, related to one another by derivation in WXS), in the last stages of specification drafting one or more interested constituencies raised such a fuss that five more time instant measuring types were added. Despite lexical and conceptual overlap, all five were made primitive types, unrelated to one another by derivation. Clearly, the committee was too exhausted to fight about it any more, so gHorribleKludge (gYearMonth, gYear, gMonthDay, gMonth, gDay...the "g" stands for "Gregorian," not "good") made it into the specification.
At least three constituencies are easily identifiable with type subcollections in WXS: the original XML/DTD collection (rooted at string, and one of two derivation trees, plus unrelated primitives); the strongly-typed programming language collection (rooted at decimal, and the other derivation tree, plus unrelated primitives); and the database collection (mostly available in the strongly typed tree, plus the time instant primitives, and assorted others). Why are the chosen primitives primitive? Why aren't base64Binary and hexBinary related? Why aren't float and double related to each other or to the rest of the numbers? Certainly if derivation in the integer tree can proceed based on register size (which it does), then one ought to be able to derive float from double. Isn't anyURI a token? No? normalizedString? No? Not even a string?
No? Really, all these date and time thingies don't have any relation to one another at all? No. There's no method to this madness. There is no way to guess whether a particular built-in type will be declared primitive or derived from another type. Nor is there any apparent value to derivation of built-in types, since validity according to the least-derived type does not guarantee validity according to most-derived type.