February 7, 2001

Leigh Dodds

During the last week, XML-DEV has been the scene of a series of interesting and innovative discussions concerning schemas in general and also specific schema languages. The XML-Deviant provides a round-up.

Grammars Versus Rules

Most schema languages rely on regular grammars for specifying schema constraints, a fundamental paradigm in the design of these languages. The one exception is Schematron, produced by Rick Jelliffe. Schematron throws out the regular grammar approach, replacing it with a rule-based system that uses XPath expressions to define assertions that are applied to documents. Further background on Schematron can be found in a recent tutorial ("Validating XML with Schematron").

Jelliffe suggested that a critique of the different paradigms may throw light on which is best, a suggestion that started the latest round of schema discussion on XML-DEV. Jelliffe said,

I believe proponents of various schema paradigms (grammars and rule-based systems) need to justify that their paradigm is useful. When we only had grammars, it was a moot point. Now we have Schematron and other rule-based systems, ... I think we can be bold enough to start to critique grammars-as-schemas. (Of course, this is a two-way street.)

...Might we get to higher-level schema languages faster by completely ditching the grammar paradigm and treating schemas as systems of logical assertions which can be queried?

XML does not have SGML's short refs and delimiter maps, so why does it now need grammar-based content-modeling?

Jelliffe's comments stem from the observation that some constraints are difficult or impossible to model using regular grammars. Commonly cited examples are co-occurrence constraints (if an element has attribute A, it must also have attribute B) and context sensitive content models (if an element has a parent X, then it must have an attribute Y).

Jelliffe further noted that alternate approaches were not considered in the design of W3C XML Schemas.

As far as I know, the decision to use grammars in XML Schemas was made completely uncritically. And how could it be otherwise if there has been no discussion? So the utility of grammar-based schemas is as well-established as the utility of the horse-drawn stump-jump plough: yes we can do excellent things with them that we would not do without them, but ... If we know XML documents need to be graphs, why are we working as if they are trees? Why do we have schema languages that enforce the treeness of the syntax rather than provide the layer to free us from it?

Response to Jelliffe's comments was mixed. While most were very keen on a rules-based approach in general and Schematron in particular, few were keen to see grammars thrown away completely. James Clark, whose own schema language TREX was released recently, was moved to respond.

Why does one kind of schema have to be better than another? It's as pointless as arguing whether hammers are better than saws. It depends [on] what problem you are trying to solve.

Grammars are better that rule-based systems for some things. Rule-based systems are better than grammars for other things. If something can be expressed simply using a grammar, it's probably a good idea to use a grammar, because, amongst other reasons, it can be implemented very efficiently. If it can't be expressed simply using a grammar, then use a rule-based system.

Joe English also said that grammar-based languages should not be rejected out of hand.

I think the utility of grammar-based schemas has been well established. Just because there are some things they can't do doesn't mean the things they *can* do aren't useful.

Summing up the core differences between the two paradigms, Jelliffe reiterated that rules-based systems are more expressive.

I am not trying to say that Schematron is strictly more powerful than RELAX or TREX, because they can model different things. I am saying, however, that I think the structures that path-based rule systems can model are better than those that grammar-based-systems can model. I don't buy the ancestor-only rule for context-processing here: there is no reason why any arbitrary set of data should form a tree rather than a graph, and consequently there is no reason why some data or structure in one part of a document may not constrain the data in another part in some important way.

A unique feature of Schematron is its user-centric approach, allowing useful feedback messages to be associated with each assertion. This allows individual patterns in a schema to be documented, giving very direct feedback to users. Indeed a recent comparison of six schema languages highlights how far Schematron differs in its design.

At times the discussion strayed into comparisons of several schema languages. Rick Jelliffe provided his interpretation of the different approaches behind TREX, RELAX, XML Schemas and Schematron:

Underlying Murata-san's RELAX seems to be that we should start from desirable properties that web documents need: lightweightedness, functioning even if the schema goes offline (hence no PSVI) and modularity. I think underneath James Clark's TREX is that we can support plurality if we have a powerful-enough low-level schema language into which others can be well translated. I think underlying W3C XML Schemas is that a certain level of features and monolithicity is appropriate (though perhaps regrettable) because of the need to support a comprehensive set of tasks and to make sure that there are no subset processors (validity should always mean validity); however the processors are monolithic but the schemas are fragmented by namespace. Underlying Schematron is that we need to model the strong (cohesive) directed relationships in a dataset and ignore the weak ones, that constraints vary through a document's life cycle, and that lists of natural language propositions can be clearer than grammars.

James Clark's summary of the advantages of TREX over W3C XML Schemas is also worth reading in its entirety. TREX, like Schematron, is a very simple yet powerful schema language.

Categories of Schema Languages

In addition to the rules versus grammar distinction, the discussion ranged over other ways of categorizing schema languages, with a view to adopting a layered approach to take advantages of the best features of each.

Rick Jelliffe categorized schema languages into two kinds: "maps" and "routes".

I think there are two kinds of schemas and therefore schema languages: one tries to express what is true of all data of that type at all times (e.g. for storage and 80/20 requirements) and another tries to express the things that make that particular information at that particular time and context different from other data of the same type. One tries to abstract away invariants, the other tries to find abstractions to express these variations.

The first kind is a map, the second kind is a route. The first kind is good for automatically generating interfaces and for coarse validation, the second kind is what is required for data-entry and debugging all data at all. (As for the status quo, I don't believe XML Schemas and DTDs pay much or any attention to this second kind of schema: maybe TREX and RELAX do a little bit and I hope Schematron is closer to the other end of the spectrum.)

Eric van der Vlist suggested that there are four types of schema languages, with categories covering structures, datatypes, rules and semantics. Van der Vlist noted that the best combination of each is application dependent; there is no single perfect schema language.

Prompted to take an attempt at further refining Eric van der Vlist's schema language categorization, Uche Ogbuji said,.

I would probably re-state Eric's first and third entries as "structural patterns" and "structural rules". The subtle difference is that the former uses a pattern language and the processing is the binary result of matching the source against these patterns. The latter evaluates generalized expressions against the source, whose results trigger actions. One possible set of actions is to accept or reject the source as valid.

...It looks as if my version of Eric's suspect line-up is as follows:

  • Structural patterns - DTD, TREX, RELAX, XSchema content model
  • Datatypes - XSchema data types, UML/XMI
  • Structural rules - Schematron
  • Semantics - RDF(S)?, XMI?, UREP?, eCo?

James Clark agreed that layering is a useful approach for XML applications.

In SGML, validation was not cleanly separated from parsing. I think this was a significant problem with SGML: you couldn't do anything to an SGML document without running it through a complex, validating parser. XML changed this with the introduction of the concept of well-formedness, which separated out validation from parsing. It didn't in my view go quite far enough in this separation: certain tasks were lumped in with validation that were logically separable. For example, validating parsers are required to handle default attributes declared in the external DTD, whereas non-validating parsers are not. This has caused nothing but trouble. Users naturally want to get the same results whether they are validating or not, which means they need non-validating parsers to do all the things required of validating parsers that affect the infoset.

Dan Brickley also noted that Schematron is leading the way in the separation of validation checking from other aspects of schema definition.

Once we have XML Query, many of the applications that folk have looked to XML schema languages to support will be rather easily implementable on top of a query processor. Schematron IMHO leads the way on this. Any state of affairs one can characterize to a query engine for the sake of finding stuff out, one can also supply to a data validity checker for the sake of making sure one's data is in the right shape. To overload Schema languages with all this work is IMHO a mistake, since we risk confusion between constraints from the core structures of an XML-based data format and the (often tighter) constraints we wish to apply when using that data format in practical applications. Same goes for RDF of course.

It's possible that after several years of experience building XML systems it's time for a rationalization of the XML framework, allowing a cleaner, layered approach.

Mixing Schemas Languages

Not completely sold on a purely rules-based approach, Eric van der Vlist suggested that a combination may yield the best benefits.

I don't think it would be very practical to design a vocabulary using only rules since there are many of them to write even for a simple structure and if you forget one of them you get a flaw in your validation check.

On the other hand, rule based tools are much more flexible than datatypes or structure schema languages.

Using them together would allow to keep the best of both and should be possible "both ways":

1) Use this schema and check these rules
but also
2) if this rule is true then use this schema

He later described a simple publishing system constructed on these principles.

James Clark discussed how Schematron patterns might be embedded within a TREX schema.

For TREX, at least, I can think of several ways in which you could integrate it with something like Schematron. For example, you could add a <validate> element to Schematron that would occur as a child of <rule> just like <assert> and <report>. The semantics would be an assertion that the tree rooted at the context node matched the TREX pattern in the <validate> element. A more elaborate possibility would be to have a top-level element (i.e. child of the schema element) that defines named TREX patterns, then add an XPath extension function that tests whether the tree rooted at the current node matches a particular named pattern; you would probably also need some way to give a helpful message pinpointing how it failed to match.

Kohsuke Kawaguchi produced a similar example demonstrating Schematron-RELAX integration. The converse, plugging in different kinds of constraint mechanisms within a Schematron schema, is also possible and under consideration for future versions of Schematron. Dan Brickley has nicknamed this "Schemarama".

It's interesting that while many of the schema debates that have taken place over the last 18 months have centered on usability and usefulness of individual schema languages, the future might actually involve a combination of languages and constraint mechanisms.

Benchmarking and Performance

Rick Jelliffe added another twist to the debate wondering about the relative performance of different schema language implementations. occurs to me that perhaps there are some basic questions we can ask (of a schema implementer or language designer) which may give a head start in the absence of benchmarking:

- Are there any innocent-looking structures that explode (or may explode) (perhaps this is the same as asking are there any constructs which, when used, may have more than an O(n) effect on performance) and what are the workarounds (e.g. in XML Schemas case, to detect unbounded particles in unbounded choice groups) ?

- How is schema evaluation affected by a slow network or unavailable websites (e.g. to use local caching, to give a user option to progress with validation as far as possible even if some components are missing, etc.) ?

In addition to giving useful feedback to developers on best practices, this analysis might highlight consistent problems with one schema paradigm. While some aspects of this discussion may seem esoteric, and perhaps not of immediate concern to those hard at work building systems today, there are still important lessons to be learned. For example in the relational database world, developers have learned the hard way that certain query and schema constructs can cause performance (and worse) problems. XML is still in its infancy, and many of the equivalent lessons have still to be learned. Explorations of some the theoretical underpinnings of our tools are important to highlight and document best practices.