July 1, 1999

Norman Walsh

Saying that a document is "valid" means that it fits within the described model of a class of documents. There are many reasons why you might want to make sure your documents are valid:

  • You're doing electronic commerce and you want to know that the purchase order you just received is exactly what you expect: it's not missing anything, it doesn't have anything extra, and everything that it does have is the right datatype (quantities are all positive numbers, prices are all decimal numbers with two digits after the decimal point, and so forth).

  • You're setting up some business-to-business process with another company. You've agreed to share information from your respective corporate databases, but they aren't quite identical. If you recieve a record from your partner's database via XML, you want to be sure that it's valid before you hand it off to the conversion tool that will insert it into your database. Invalid transactions should be rejected immediately so that there's no possibility of bad data slipping into your database.

  • The XML document you're constructing is going to control some overnight batch process and you want to make sure that the instructions you're sending are ones the processor is going to understand. You don't want the process to stop at 2:00am because you forgot to include some required information.

  • You've got a 1000 XML documents that you want to publish on a CD-ROM. You want to be confident that your stylesheet will present each of them correctly without proofing each and every one by hand. If you know that you're stylesheet handles all of the valid constructions in your schema, then you know it'll do the right thing if all your documents conform to the schema.

Using a schema and a validating parser offers one standard way to test your documents. (Valid documents can still be semantically wrong: you can submit a purchase order that asks for a hundred boxes of staples when you meant to ask for ten, but checking validity catches a lot of "obvious" errors.)

Every document that you encounter can be defined in one of four ways:

  1. If it is not well-formed, it isn't XML.

  2. If an XML document does not identify a schema to which it claims to conform (and no schema can be inferred), then it is simply well-formed.

  3. If a schema is (or can be) associated with a document, and the document does not fit within the model described by that schema, it is well-formed but not valid.

  4. If a schema is (or can be) associated with a document, and the document does not violate any of the constraints of that schema, it is well-formed and valid.

There are basically two kinds of validity which most people expect schemas to be able to test: the validity of content models and the validity of specific units of data.

Content Model Validity

Content model validity tests whether the order and nesting of tags is correct. Part 1 of the XML Schema WD defines how a schema indicates the correct order and nesting of elements.

An address, for example, might be defined as having a required <name> tag, one or more <street> tags, a required <city>, a required <state>, a required <zip>, and an optional <country> tag.

In XML Schema syntax, the content model of an address could be described like this:

<elementType name="address">


    <elementTypeRef name="name" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="street" minOccur="1" maxOccur="2"/>

    <elementTypeRef name="city" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="state" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="zip" minOccur="1" maxOccur="1"/>

    <elementTypeRef name="country" minOccur="0" maxOccur="1"/>



(For a description of the syntax of XML Schema, see the syntax section.)

If you encounter a address that doesn't meet these criteria, it isn't valid (according to the address schema).

Datatype Validity

Datatype validity is the ability to test whether specific units of information are of the correct type and fall within the specified legal values.

For example, if I am writing a schema for catalog order forms, I should be able to express the constraint that the quantity ordered is greater than zero. An order form isn't valid if the quantity of an item ordered is "-5" or "blue".

The ability to express datatype validity in a schema is one of the really new features of XML Schema. Although database schema have always had this ability, XML DTDs do not. DTDs have extremly limited datatyping.