Not My Type: Sizing Up W3C XML Schema Primitives
by Amelia Lewis
|
Pages: 1, 2, 3
Given that there are so many types defined by WXS part two, everyone ought to be happy. Right? Well, everyone except scientists or anyone else who might want things like complex numbers, rational numbers, even imaginary numbers, or particular precision. But we've already agreed that academics don't need data types. Real applications are all handled. I won't have any trouble representing an ISBN or a credit card with its embedded check digits. Lisp programmers can use rational numbers. The boolean type can express true and false in any language. Certainly I can specify that a node has type XPath. Yes?
No.
Well, perhaps this is because there is a strong, conscious attempt to keep the number of primitive types to a minimum. That's why there are only eight time instant datatypes, and... Let's not continue down that path. It leads nowhere; there was no attempt to keep the number of types to a minimum. Therefore, a scientific computing application must handle the possibility of the declaration of a NOTATION type, or NMTOKENS, or language. It does not matter that the problem domain does not need these data types. They are defined, so the application had better be prepared to cope with them.
Semi-structured types
A further problem lies in the inclusion of semi-structured types. Almost all of the time instant types have this portmanteau characteristic; a type with a name of the pattern floorwaxDessertTopping should alert the reader to an imminent experience, live and from New York. Even lists (the simplest of non-simple data types) are potentially problematic. The actual locus of validation is on each list component, not the list as a whole. If XML preserved the markup minimization feature of SGML, lists would be utterly superfluous. As it stands, the locus of validation is each component of the list, not the list-as-a-whole. Instead of tags supplying context for simple content validation, position does so. And it does the same for other semi-structured types, most notably dates.
|
Related Reading
XML in a Nutshell, 2nd Edition |
If a structural schema definition language happened to include support for co-occurrence constraints, it's quite likely that no one would need to demand portmanteau gDayYear-style types. "Thirty days hath September ..." is a children's rhyme, and a mnemonic, but it is also an algorithm. The second month may only have twenty-eight days, unless the year is evenly divisible by four, except when the year is evenly divisible by one hundred and is not evenly divisible by four hundred. A language supporting co-occurrence constraints could say "gBye" to semi-structured types.
Nothing can be done to fix WXS, until the single sentence -- "Primitive datatypes can only be added by revisions to this specification" -- is fixed. On the other hand, it says nothing about removing types, so perhaps we can clean it up after all.
Getting it right
"If you can't say something nice, don't say anything at all". Good advice from my mother to me, and all of the foregoing has been not only completely destructive criticism, but has been, in places, offensively phrased as well, and I personally know some of the current and former members of the XML Schema Working Group, so I may well suffer for it. Taking that advice, I will say something nice. Admittedly, I'm going to say it about Relax NG, but you can't have everything.
The Relax NG specification does not resolve data typing problems. However, it took the separation of focus in the XML Schema specification and made it more robust and more flexible. In Relax NG, any data type library may be used. Definition of a data type library is not supplied, except by reference to XML Schema part two, and by definition of a minimal type library (string and token).
However, an effort outside the OASIS technical committee has established data type library interfaces suitable for use with Relax NG validators. These interface definitions (available from SourceForge, for several target languages) are extremely valuable in refining the concept of a data type in XML.
The goal, stated previously, was loosely to enable computer-to-computer interactions with strong typing in XML. What does it mean to define a data type in XML? Clearly, from the point of view taken here, in WXS part two, and in the interface specification for type validation in RNG, we are discussing the typing of "simple" types. That is, types of nodes that contain textual content, rather than or in addition to element content. Attributes may have types, and the ephemeral "text node" children of elements may have types, and these "simple types" are what we are concerned with.
What is a data type in XML? Four answers: a string; smething that can be expressed as a string, following certain constraints; something that can be validated by a specified algorithm; or something that corresponds to a simple concept in my problem domain.
In XML, everything is a string. Since XML contains text, everything is, by definition, expressible as a string. If we take this as fundamental, then every type in XML is simply a string with certain patterning constraints. There are some problems with this concept, but it's useful to keep in mind. If every data type in XML must be derived from string, then there is no need for anySimpleType, either. Any simple type is a string. Relax NG enhances this notion to add token, relying upon the potentially special treatment that an XML parser can give to whitespace. But a token is just a kind of string, one that doesn't contain whitespace.

