DSDL Interoperability Framework
April 30, 2003
1. What's DSDL?
DSDL ("Document Schema Definition Languages") is a project of the ISO/IEC JTC 1/SC 34 (chair, Jim Mason) Working Group 1 (chair, Charles Goldfarb). The word "document" is meant to be read as "XML document-oriented applications", as opposed to data-oriented applications. "Languages" is used in the plural form because DSDL is not intended to point to a One True Schema Language.
DSDL is chaired by Martin Bryan, and its editors include James Clark, Murata Makoto, Rick Jelliffe, Martin Bryan, Diederik Gerth van Wijk, Ken Holman and me (Eric van der Vlist).
DSDL is necessary because other XML schema languages (primarily W3C XML Schema) do not meet the needs of "document heads", and document validation is too complex to be done using a single language. Our goal is to propose a set of specifications which will include a framework, several schema languages (including Relax NG and Schematron), a datatype system, and other pieces needed for document validation.
2. Why an interoperability framework?
Why does DSDL need an Interoperability Framework? The quick answer is that the Interoperability Framework is the glue between all the pieces of DSDL. The chief design principle of DSDL is to split the issue of describing and validating documents into simpler issues: grammar based validation, rule based validation, content selection, datatypes, and so on. Different types of validations and transformations, defined inside or outside the DSDL project, often need to be associated with each other. The framework allows for the integration of these validations and transformations.
Examples of such mixing include localization of numeric or date formats, prevalidation canonicalization to simplify the expression of a schema, independent content separated into different documents to be validated independently, aggregation of complex content into a single text node, separation of structured simple content into a set of elements, and so on.
3. At the beginning: two complementary proposals
The DSDL interoperability framework is a work in progress. Its first wave gave birth to two different proposals, based on two different and complementary approaches: Rick Jelliffe's Schemachine and my Xvif.
3.1. Rick Jelliffe's Schemachine
We can think of Rick Jelliffe's Schemachine as "traditional" in the sense that his proposal is a continuation of XPipe or the W3C "XML-Pipeline" Note. It describes pipes of transformations and validations applied to full documents.
3.1.1. Schemachine basics
Rick Jelliffe gives the following description of his proposal. It is based on XML Pipeline structures, but with rearrangement and renaming. It is embedded in Schematron-like superstructure with titles and phases and able to be implemented minimally -- all validators and translators are command-line executable programs, and the framework document is translated into BAT files or Bourne shell scripts (i.e., validators etc. are treated as black boxes). Schemachine aims at validation rather than declarative description per se. (In particular, the further down a transformation chain that data gets, the more difficult it will be to tie the effect of a schema to the original document.) It supports both validation of explicit structure and validation of complex data values. It leaves issues of simple datatyping to particular validators, viewing validation as a tree of processes. Finally, it supports in (@exclude) and out of band signaling (@haltOnFail).
3.1.2. Schemachine example
A couple of short examples are better than a long explanation.
<schemachine xmlns="...."> <title>Example Schema</title> <pass> <validate engine="schemachine:xsd" /> <validate engine="schemachine:schematron"> <param name="schema" href="a Schematron schema"/> </validate> </pass> </schemachine>
This first example passes a document through a W3C XML Schema validation followed by a Schematron validation.
<schemachine xmlns="...."> <title>Another Example Schema</title> <ns prefix="html" url="..." /> <pass> <select engine="schemachine:namespace_selector"> <param name="pattern">html:body</param> <output name="htmlbody" /> </select> <validate engine="schemachine:relax_ng"> <param name="schema" href="...."/> <param name="feasible">true</param> <input name="htmlbody"/> </validate> </pass> </schemachine>
Here the document is passed through a "selector" which selects the
html:body element. The output of the selection is used as the input
of a Relax NG validation.
3.1.3. Schemachine features
Rick Jelliffe carefully crafted a proposal with all the features needed to validate complex documents. Some concepts (e.g., phases) are inherited from Schematron, and Schemachine has all the bells ands whistles needed to fly:
Phases let users define different validation phases.
Selectors are filters which retain only the part of a document on which a partial schema will be applied.
Validators are containers to invoke schema validation.
Tokenizers split a text node into a set of elements.
Titles let you define info for the validation report.
3.2. My own XVIF
While Jelliffe has come up with a solid proposal obviously easy to implement, I wanted to explore more adventurous fields and felt that a proof of concept was needed to check the dangers and potential of my ideas.
XVIF ("XML Validation Interoperability Framework") is both a framework proposal and a prototype written in Python. It is available under an MPL open source license.
3.2.1. XVIF basics
XVIF has both very similarities with and differences from the approach taken by the Schemachine. It's designed to be used within a "host language" -- which could be a schema language (Relax NG, W3C XML Schema, Schematron), a transformation language (XSLT, Regular Fragmentations, STX) or a "pipelining" language (XVIF could be embedded within the structure of the Schemachine, Ant, XPipe). The current version of the prototype implements only XVIF within Relax NG. XVIF defines "micro-pipes" of transformations and validations applied locally on the "current" node. It integrates tightly with hosting languages: for Relax NG, a XVIF pipes are patterns; for XSLT they would be extension elements. XVIF has fallback mechanisms to ensure that a schema or transformation can be read by non-XVIF aware processors. It is is currently minimalist: bells and whistles will be added if it flies. XVIF takes advantage of the structures of the host language for complex features. Finally, it's focused on defining the basic building blocks. Shortcuts will be added later on where needed and verbosity isn't an issue at this stage.
3.2.2. XVIF example
Let's look at our first example of XVIF:
<?xml version="1.0" encoding="utf-8"?> <element xmlns="http://relaxng.org/ns/structure/1.0" xmlns:if="http://namespaces.xmlschemata.org/xvif/iframe" name="foo"> <if:pipe> <if:transform type="http://namespaces.xmlschemata.org/xvif/regexp" apply="split/,/"/> <if:validate type="http://relaxng.org/ns/structure/1.0"> <if:apply> <oneOrMore> <choice> <value>foo</value> <value>bar</value> </choice> </oneOrMore> </if:apply> </if:validate> </if:pipe> </element>
This example defines a Relax NG schema where the implicit "start" pattern is an element with name "foo," and whose content is validated by a pattern "if:pipe". This is a micro-pipe of transformations and validations applied to all the elements, text nodes and attributes found in the "foo" element.
The pipe itself is a transformation, splitting text nodes using the regular
/,/, and a Relax NG validation applied to the result of this
A text node will thus be interpreted as a comma separated list of values, and the list validates against a Relax NG schema expecting one or more values equal to "foo" or "bar".
3.2.3. XVIF features
The most basic building block of XVIF is "if:transform", which defines a transformation.
<!-- The context nodeset "x" is defined by the host language here --> <if:transform type="URI identifying the nature of T"> <if:apply> Implementation of T </if:apply> </if:transform> <!-- The result of the transformation "y=T(x)" is the context nodeset here -->
Note that the implementation of T may be held in an "apply" element or attribute, and it may be located in an external resource (if:apply/@href).
A validation is simply a transformation which returns either its input or an error.
<!-- The context nodeset "x" is defined by the host language here --> <if:validate type="URI identifying the nature of V"> <if:apply> Implementation of V </if:apply> </if:validate> <!-- The pipe is aborted if the result is false, otherwise, the context nodeset is left unchanged -->
Transformations and validations can be chained in pipes.
<if:pipe> <!-- The context nodeset "x" is defined by the host language here --> <if:transform type="URI identifying the nature of T2"> <if:apply> Implementation of T2 </if:apply> </if:transform> <!-- The result T2(x) is the context nodeset here --> <if:validate type="URI indentifying the nature of V1"> <if:apply> Implementation of V1 </if:apply> </if:validate> <!-- The pipe is aborted with an exception if the validation fails. The context node is unchanged otherwise. --> <if:transform type="URI identifying the nature of T1"> <if:apply> Implementation of T1 </if:apply> </if:transform> <!-- The result y=T1(T2(x)) is the context nodeset here --> <if:validate type="URI identifying the nature of V"> <if:apply> Implementation of V </if:apply> </if:validate> <!-- The result of the validation of y by V is the result of the pipe.--> <if:pipe>
That's all there is to it.
3.2.4. Why micro-pipes
The examples shown so far were simple and do not demonstrate any differences from the Schemachine approach. There are a couple of reasons where micro-pipes are used.
Modularity: these pipes can be used in named patterns and reused in lieu of native Relax NG patterns:
<define name="csv"> <if:pipe> <if:transform type="http://namespaces.xmlschemata.org/xvif/regexp" apply="split/,/"/> <if:validate type="http://relaxng.org/ns/structure/1.0"> <if:apply> <oneOrMore> <choice> <value>foo</value> <value>bar</value> </choice> </oneOrMore> </if:apply> </if:validate> </if:pipe> </define> .../... <element name="foo"> <ref name="csv"/> </element>
Micro-pipes play nicely with the schema language. If we want to validate the list as a comma-separated list if a type attribute is "csv", and as a whitespace-separated list if the type attribute is "list", we can write
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:if="http://namespaces.xmlschemata.org/xvif/iframe"> <start> <element name="foo"> <choice> <group> <attribute name="type"> <value>csv</value> </attribute> <ref name="csv"/> </group> <group> <attribute name="type"> <value>list</value> </attribute> <list> <ref name="check-values"/> </list> </group> </choice> </element> </start> <define name="check-values"> <oneOrMore> <choice> <value>foo</value> <value>bar</value> </choice> </oneOrMore> </define> <define name="csv"> <if:pipe> <if:transform type="http://namespaces.xmlschemata.org/xvif/regexp" apply="split/,/"/> <if:validate type="http://relaxng.org/ns/structure/1.0"> <if:apply> <ref name="check-values"/> </if:apply> </if:validate> </if:pipe> </define> </grammar>
In complicated cases, micro-pipes keep the transformations and validations close to the locations where they are needed. I think that it is important to ensure the structure of the document is coded with a schema language, instead of being a combination of selectors and bits of schemas. Of course, these are only guesses and I don't think anyone has enough experience to have the final word in this debate.
An interoperability framework can't be an isolated technology, and XVIF is linked to many other developments. These links to other technologies include:
W3C XML Schema: I see no reason why XVIF couldn't be associated with W3C XML Schema as it is with Relax NG.
Schema annotation: we have seen how closely related transformation and validation are: validations could be extended to add annotations to instance documents.
XPath 2.0/XPath NG axis: these annotations could be used by the proposal from Jeni Tennison to add extension axis to XPath 2.0 and/or an eventual "XPath NG".
Schemachine: some features of Schemachine could be added to XVIF or, XVIF could be used within the Schemachine framework, or a standalone version could be developed.
XSLT: XVIF could be used as a XSLT extension element.
Finally, I am considering adding some features to XVIF:
4. The latest proposal
The two initial proposals (Schemachine and XVIF) were presented to the ISO DSDL working group in Baltimore (December 2002); although they were considered a valuable input, both were rejected, for different reasons:
Schemachine was considered "too procedural": its focus is on defining pipes, that is, defining the algorithm used to validate a document, while it would be more appropriate to focus on defining the rules to meet to consider that a document is valid.
XVIF was considered too intrusive: to fully support XVIF, the semantics of the different schema languages must be extended and the schema validators need to be upgraded. An interoperability framework should work with existing schema languages and processors without requiring any update.
To take these two requirements into account, a new proposal has been made which builds upon ideas from Schemachine and XVIF, but also from XSLT and Schematron. This proposal has been named "XVIF/Outie", after a joke from Rick Jelliffe. A description of XVIF/Outie can be found at http://downloads.xmlschemata.org/python/xvif/outie/about.xhtml and a prototype implementation is available.
4.1. xvif/outie basics
The basic ideas behind Outie are pretty simple:
Outie is all about defining assertions.
These assertions are schema validations applied on instance documents.
These instance documents can be the instance document presented for validation, other documents, or results of transformations.
Assertions about the same instance can be grouped into rules.
The basic building blocks of an Outie framework are rules. Each rule is about checking one and only one instance document. By default this instance document is the instance document presented for validation.
Other instance documents may be selected inline, by specifying a transformation to apply on existing instance or by reference, through a URL or reference to a variable. Global variables may be defined to store the result of transformation.
Rules may belong to a "mode" and rules for a mode are explicitly applied. Outie is purely declarative and side effect free: rules and variable definitions may appear in any order; the order in which rules and assertions are processed is not guaranted; variables which are not used may never been evaluated.
4.2. XVIF/Outie example
Here's an example that shows most of the features. We define that a document is valid if and only if it is valid after transformation by "normalize1.xsl" per the schemas "schema1.sch" and "schema1.rng" or if it is valid after transformation by "normalize2.xsl" per the schemas "schema2.sch" and "schema2.rng".
A framework to express this, using a variable to store the result of the transformation by "normalize2.xsl", could be:
<?xml version="1.0" encoding="utf-8"?> <framework> <rule> <assert> <choice> <apply-rules mode="mode1"/> <apply-rules mode="mode2"/> </choice> </assert> </rule> <rule mode="mode1"> <instance> <transform transformation="normalize1.xsl"/> </instance> <assert> <isValid schema="schema1.sch"/> <isValid schema="schema1.rng"/> </assert> </rule> <rule mode="mode2" instance="$instance2"> <assert> <isValid schema="schema2.sch"/> <isValid schema="schema2.rng"/> </assert> </rule> <variable name="instance2"> <transform transformation="normalize2.xsl"/> </variable> </framework>
4.3. More XVIF/Outie features
We have seen most of the features of Outie in this first example, let's review some of the more "hidden" aspects.
Outie is purely declarative and side effect free:
Rules and variable definitions may appear in any order.
The order in which rules and assertions are processed is not guaranted.
Variables which are not used may never been evaluated.
The tools to apply for a transformation or schema validation is implicit.
The choice of the tool is function of the document type.
The document type is assumed from the extension of the document.
Implementations need to provide a way to define the match between extensions and tools.
Schemas can also be the result of transformations.
To illustrate this last point, we can take the example of a schema created by the "getStage" transformation proposed by Bob DuCharme:
<?xml version="1.0" encoding="utf-8"?> <framework> <rule> <assert> <isValid> <schema> <transform extension=".xsd" instance="schema.xsd" transformation="getStage.xsl"> <with-param name="stageName" select="'final'"/> </transform> </schema> </isValid> </assert> </rule> </framework>
Or, using a variable:
<?xml version="1.0" encoding="utf-8"?> <framework> <variable name="final"> <transform extension=".xsd" instance="schema.xsd" transformation="getStage.xsl"> <with-param name="stageName" select="'final'"/> </transform> </variable> <rule> <assert> <isValid schema="$final"/> </assert> </rule> </framework>
The intention for Outie is to get the approval of the ISO DSDL working group and, ultimately, to become an ISO DIS (Draft International Standard).
But first some issues need to be fixed:
Some transformations may split an instance document into several pieces, how do we address the different pieces in this case?
Should the format of the validation report be specified?
Should the format of the configuration file matching extensions and tools be specified?
More issues will probably be raised during the ISO meetings in London (May 2003), held during the XML Europe conference.
Xvif/Outie or something derived from it should become an ISO DIS. I am also committed to develop Xvif and its micro pipes. When Outie becomes more stable, I will make sure to find a convergence between the two Xvif flavors.