XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Validation by Instance

August 28, 2002

Most people these days develop XML documents and schema with a visual editor of some sort, perhaps Altova's XML Spy, Tibco's TurboXML, xmlHack from SysOnyx, or Oxygen. Some even use several editors on a single project, depending on the strengths of the software.

Others prefer to work closer to the bone. I usually develop my schema and instances by hand, using the vi editor, along with other Unix utilities (actually, I use Cygwin on a Windows 2000 box). I don't want to make more work for myself, but I prefer to use free, open source tools that allow me to make low-level changes that suit my needs. If you prefer to work this way, you should enjoy this piece.

In this article, I will explore how you can translate an XML document into a Document Type Definition (DTD), a RELAX NG schema, and then into an W3C XML Schema (WXS) schema, in that order. I'll do this with the aid of several open source tools, and I'll also cover a way to validate the original XML instance against the various schemas.

Related Reading

XML in a Nutshell, 2nd Edition

XML in a Nutshell, 2nd Edition
A Desktop Quick Reference
By W. Scott Means, Elliotte Rusty Harold

The tools are all Java-based. To get things to work, you will need to have version 1.2 of Java or later installed on your system, have your path and classpath variables set correctly, and be ready to download and install several free tools. I used Java 2 v1.4 when testing the examples in the article. When I use the word install in relation to a JAR file, I mean that it is somewhere on your file system and is within reach of the classpath.

All the schema, instance, and batch files mentioned in this article are stored in a ZIP archive that is available for download.

Translating an XML Document into a DTD

Consider a simple XML document that describes the date of an event in several formats:

<?xml version="1.0" encoding="UTF-8"?>

 <description>Final sale of property.</description>
 <date type="ISO">
 <date type="Euro">
 <date type="US">

To translate the XML document into a DTD, I'll use Michael Kay's DTDGenerator. Originally, DTDGenerator was part of the Saxon XSLT processor, but now it is separate. At just 17kb, it's a pretty small download. DTDGenerator does a fair amount of work for you, but it doesn't produce parameter entities, notation declarations, or entity declarations. It's also not namespace-aware, but DTDs aren't inherently aware of namespaces or qualified names anyway.

With dtdgen.jar in your classpath, enter the following command line:

java -cp dtdgen.jar DTDGenerator event.xml > event.dtd

This command produces the following output, redirecting it to the file event.dtd:

<!ELEMENT date ( day | month | year )* >
<!ATTLIST date type ( Euro | ISO | US ) #REQUIRED >
<!ELEMENT day ( #PCDATA ) >
<!ELEMENT description ( #PCDATA ) >
<!ELEMENT event ( description, date+ ) >
<!ELEMENT month ( #PCDATA ) >
<!ELEMENT year ( #PCDATA ) >

Of course, this isn't the only possible DTD for the data model in event.xml. It is only one possibility. DTDGenerator makes educated guesses about the content models it sees in an instance. It may not be what you intend, but at least you are several rungs up the ladder.

There are few things to note. First, the element type declarations are listed in alphabetical order, not in the order of appearance in the instance. The content model for the date element allows a choice of day, month, or year, according to the variations in the instance.

There is only one description element, so the content model in the DTD reflects that. Likewise, because the event element contains more than one date element, the content model allows one or more (date+).

The type attribute has enumerated values only because I tweaked some fields in the source code (DTDGenerator.java) and recompiled. MIN_ENUMERATION_INSTANCES represents the minimum number of times an attribute must appear for it to be an enumeration type. Also, an attribute is considered an enumeration only if the number of instances divided by the number of distinct values is greater than or equal to MIN_ENUMERATION_RATIO. Normally, the value of MIN_ENUMERATION_INSTANCES is 10 (I switched it to 0), and the value of MIN_ENUMERATION_RATIO is 3 (now 1). These changes let me control what is considered an enumeration to suit the document. This is why I like working with open source code: It allows me to make changes to the code to meet specific needs.

Now that we have a DTD I'll use another tool to convert it to a RELAX NG schema. It's called DTDinst.

Pages: 1, 2

Next Pagearrow