Constraining Validation

August 25, 2004

The vacation season has brought renewed vigor in discussion among XML developers. A recent thread about validation illustrated one of the most useful properties of XML-DEV as a community: technical discussions often evolve to encompass the wider context of practice in which they are relevant.

Constraints and the Role of Validation

One of the advantages of the Web is that rough and ready data on the spread of ideas and technologies can be gleaned quickly. Whether it's searching weblogs or watching book sales, handy first-pass market research can be gotten on the cheap.

There have been similar benefits obtained in the XML world. The FOAF project uses statistics from spidered FOAF files to inform the decisions it makes about the development of the vocabulary. A mail from Priti Patil to XML-DEV recently aimed to gather such deployment information for the identity-constraints feature of W3C XML Schema (WXS). The resultant discussion not only shed light on schema usage, but also opened up the larger question of the role of validation and business rules in a system.

Identity constraints allow a schema author to indicate that the value of some part of an XML document must be unique with respect to certain other contents in the document. The analogy with databases is that of a key. WXS also allows the analog of constraining values as foreign keys. Patil's mail to XML-DEV asks about the use of such constraints:

I surveyed the schemas that are provided in the repository of XML.org. But none of the schema[s] contains the use of integrity constraints.

Thus [the] question that comes into my mind is ... Are people really using Identity constraints specified in XML schema?

Michael Kay, developer of the SAXON XSLT and XQuery processor, explained why such usage is rare:

The biggest problem here is that the scope of a database and a document are different. A document described by an XML schema usually describes one business object, whereas a relational database describes many. Most of the constraints in a relational database are cross-object constraints, whereas XML schema can only describe intra-object constraints.

Jonathan Robie agreed with Kay's diagnosis, saying it was confirmed by his own experience. Henry Thompson brought up some data he gleaned from surveying publicly web-accessible schemas. His data shows that just over 5% of those schemas used the identity constraints.

The thread of conversation then developed interestingly in response to the second half of Kay's message, in which he expresses reservations about including too many constraints:

If I see a schema (XML or RDB) with the constraint that employees must be over 16, I ask myself what the IT department would do if the business decided to hire someone under 16. If there's a rule that an employee's manager must themselves be an employee, I ask what would happen when someone is told that they now report to a contractor. It's not the job of computers to limit what people are allowed to do (or the job of the IT department to regulate the business). A guideline I use is that constraints should be there only to protect the IT system itself from data that it cannot handle.

Talk of constraints raises the question of what role XML schema validation should play in a system as a whole. Roger Costello found Kay's arguments on constraints persuasive, asking him to elaborate on what he saw the role of validation to be. Kay responded that he saw two main roles for validation:

(a) to protect the system from data that it cannot handle ...

(b) to enforce a contract. If you have a contract with the supplier of a news feed that all articles sent will carry either today's or yesterday's date, then you should check that your supplier is keeping to the contract.

Does Kay, then, exclude the use of schemas to enforce business rules, perhaps a tempting use case?

I think there is also room for validation processes that check data to see if it conforms to business rules. But very often, that should result in some kind of exception reporting, rather than rejection of the data. Often it will be correct, valid data, revealing that the business rules have indeed been breached.

Along similar lines, Bruce Cox wanted to know which technologies should be used in which places:

Since XML Schema data typing cannot express all the business rules that constrain, for example, patent document numbers (from about 100 issuing offices), what other technologies can be invoked that would? What combination of technologies should be used, and in what order, to accomplish the goal? I'm interested in standards-based technologies, such as XML Schema, to express such rules in a fashion that removes as much variation as possible among systems that implement them around the world.

This question brought out several technology recommendations.

Dare Obasanjo noted that Schematron is a worthwhile and oft-overlooked stage in expressing constraints. Mark Seaborne proposed the XForms model as a possible alternative to Schematron at some future point, noting that there was no reason that the model had to be used with a form.

Christian Nentwich mentioned a product he works on, CliX/xlinkit, which uses XLink-based constraints and has a rule editor. "If you're going to have a large number of rules, having an editor for them will make you happy." I suspect we'll see more of such editors in the future.

The longest thread of response came from an observation by a user identifiable only as bry@itnisk.com. Mr. Itnisk (so I shall name him for the sake of convenience) mentioned that CAM (Content Assembly Mechanism) sounded like a solution for the scenario Bruce Cox mentioned. According to OASIS, which hosts the CAM committee, a CAM processor is able to "provide documentation of information exchange formats, validation of transaction instances, and runtime creation of valid transaction documents."

CAM is one of those technologies I've heard a lot about from those who are on the committee, but very little from outside that group. So what did XML-DEV make of it?

Peter Hunsberger asked the salient questions right away:

1) Just glancing at the spec it appears to have at least some overlap with Schematron for parts of it. Anyone looked at a Schematron to CAM(/subcomponent?) conversion or the converse?

2) Anyone using this for anything production like?

3) Any recommended software?

Itnisk's response pointed out a single test deployment and implementation, but the indications seemed to be that CAM was very early-stage.

David Webber, chair of the CAM committee, explained CAM's relationship to Schematron:

So CAM really extends the [S]chematron work - and combines it with the original work from UN/CEFACT on core components and use of a semantic registry to provide vocabularies of components -- and then into a content assembly mechanism for ebusiness transactions.

Len Bullard wanted to know why CAM was any better than "SQL stored queries and merge-laden scripted functions."

Webber responded that CAM included both generation and validation abilities:

CAM certainly does not have any lock-in on doing that classic content-merge-output stuff. It's just another option in that particular camp ... The differentiator vis SQL is that CAM can work directly off an input XML instance as its content source.

... that's one way to use CAM -- the other way is as a validation tool So people can answer the question -- does my XML instance here conform to your business rules? And if not -- why not!

... you can publish the CAM template as the yardstick ... for creating valid instances.

All of which brings us back to the earlier question of whether business rules are the same thing as validation anyway (here's one take on the difference). Though the discussion continued, we will leave it there for now.

It seems there is definitely still more experience to be gained in the business of validation, and where it is most appropriately performed. The consensus of the state-of-the-art in implementation is that schemas and a rule language of choice should handle what they can (and no more than they should) and application logic fills in the gaps. Hopefully in the future we might find a consensus around architecture, too.

Also in XML-Deviant

The More Things Change

Births, Deaths, and Marriages

The latest announcements from the XML-DEV mailing list.

Keynote Speakers for XML 2004 Announced: A high-flying line-up, most intriguing of whom would seem to be Michael Daconta, Metadata Program Manager, U.S. Department of Homeland Security.
Program Announced for Semantic Technologies for eGov Conference 2004: So now you can find out exactly what they are. W3C's Eric Miller keynotes.
Slides from "Web Services Security Issues" Online: Presentation slides from the Aug. 10 SDSIC web services panel discussion are now online. They include some from the ever-interesting Michael Leventhal.
Open Source CMS Conference 4: Not announced to XML-DEV, but of interest to the community. Open Source content management conference to be held Sept. 29 – Oct. 1, 2004, in Zurich, Switzerland.

Scrapings

Sufficiently advanced software indistinguishable from magic? ... Lack of name in "From" header considered harmful: somebody please offer mailer configuration courses ... 140 messages to XML-DEV last week, Len rating 15.7% (vigor renewed!) ... An ill-boding forerunner of binary XML? ... CAM too much for some. Please David, make it stop!