
Normalizing XML, Part 2
When one of the many father-son carpentry projects of my childhood would make its inevitable leap from confidence to confusion, the elder Provost's face would acquire a strangely bemused quality as he pronounced the day's lesson: "Nothing is simple."
For better or worse, it's on that note that we resume our discussion of
the applicability of concepts of data normalization to XML document
design. In part one of
this article, we observed that while XML's hierarchical model is somewhat
at odds with the rectangular structure of relational data, the goals of
data normalization as stated in the relational are certainly worthwhile.
We've also seen the usefulness of the
normal forms
of relational theory -- perhaps not applied literally, but posed as
challenges to find equally strict guidelines for XML design. The basic
trick of RDB normalization -- the foreign-key relationship
-- has been duplicated for W3C XML Schema (WXS) using the
key and keyref components to avoid in a model.
Everything's just humming along.
But nothing is simple. In this second and final part, we'll look at some of the subtler issues of "normalized" XML data design, as well as complete our run through the normal forms to see how well they apply to XML.
When Not to Normalize
The rental-housing example from part one shows the basic technique of defining an XML association between complex types so that instances of one of those types can be referenced by multiple instances of the other. (See the schema Housing2.xsd and instance document Listings2.xml.) This addresses the goals of the second and third normal forms by eliminating redundant statements of fact in the XML document.
It's easy to carry this sort of decomposition too far, however, and it's especially tempting for those with relational backgrounds to overuse XML association. In relational database design, foreign keys are commonly used for either of two purposes -- on the one hand, one-to-many relationships; and, on the other, many-to-many or many-to-one relationships, in which multiple objects share some other value, such that if the referenced value changes, it changes for all referencing objects. Because the statement of a foreign-key relationship alone does not distinguish these cases, SQL includes semantics for "cascading delete" to assert a truly compositional relationship between tables.
In XML, though, multiple cardinality can be managed through simple composition. (Here again is that fundamental difference in data shape we discussed in part one.) So, in XML, the mere fact of a one-to-many relationship does not in itself call for association through keys. A good rule of thumb for relational database designers: if you would have applied a cascading delete to a foreign-key relationship, then you're talking about composition in XML.
Here's a quick example: a product-order model including Order records and separate Item records:
| Table: Order | Table: Item | |||||
|---|---|---|---|---|---|---|
| PK | OrderNumber | INTEGER | PK | SKU | VARCHAR(12) | |
| CustomerName | VARCHAR(64) | Name | VARCHAR(32) | |||
| Date | DATETIME | Price | CURRENCY | |||
| Quantity | INTEGER | |||||
| FK | OrderNumber | INTEGER |
The two-table decomposition is made necessary by the multiplicity of Items per Order; this is not a matter of association but of composition. (The likely association is actually between Item and a third type, Product, using SKU as a key, yielding an attributed, many-to-many relationship from Order to Product.)
An XML document would express the items as children of the order,
as shown below. This model enforces the compositional
relationship in ways that a key/keyref association
would not -- in the latter case it would be possible for two
Orders to share a line Item.
<complexType name="Order">
<sequence>
<element name="OrderNumber" type="integer" />
<element name="CustomerName" type="string" />
<element name="Date" type="date" />
<element name="Item">
<complexType>
<sequence>
<element name="SKU" type="string" />
<element name="Name" type="string" />
<element name="Price" type="decimal" />
<element name="Quantity" type="integer" />
</sequence>
</complexType>
</element>
</sequence>
</complexType>