Normalizing XML, Part 1
As regular readers of the XML Schema Clinic likely know, I tend to view the world of XML through object-oriented glasses. For this installment, though, we're reaching out to the relational data folks, switching lenses for one eye at least. The goal is to see what relational concepts we can usefully apply to XML. Can the normal forms that guide database design be applied meaningfully to XML document design?
Note that we're not talking about mapping relational data to XML. Instead, we assume that XML is the native language for data expression, and attempt to apply the concepts of normalization to schema design. The discussion is organized loosely around the progression of normal forms, first to fifth. As we'll see, these forms won't apply precisely to XML, but we can adhere to the law's spirit, if not its letter. It's possible to develop guidelines for designing W3C XML Schema (WXS) that achieve the goals of normalization:
In this first of two parts, we'll consider the first through third normal forms, and observe that while there are important differences between the XML and relational models, much of the thinking that commonly goes into RDB design can be applied to WXS design as well.
The first normal form in relational theory is so fundamental it's often forgotten. It states that every record in a table must have the same number of fields. Relational data is rectilinear; a table is a matrix of rows and columns.
| Table: Car | ||
|---|---|---|
| Make | VARCHAR(64) | |
| Model | VARCHAR(64) | |
| Year | INTEGER | |
| Color | VARCHAR(64) | |
| PK | VIN | VARCHAR(6) |
Of course, XML is hierarchical by nature. It can capture matrices easily, though, making the table a parent of the row and the row a parent of the field, as shown in the schema fragment below. (Fields could also be expressed as attributes of the row element.)
<element name="Car">
<complexType>
<sequence>
<element name="Make" type="string" />
<element name="Model" type="string" />
<element name="Year" type="integer" />
<element name="Color" type="string" />
<element name="VIN"><simpleType>
<restriction base="string">
<pattern value="[A-Z]{2}[0-9]{4}" />
</restriction>
</simpleType></element>
</sequence>
</complexType>
</element>
The tricky part is that XML can also render non-rectilinear "shapes" for data, thus fitting the type systems of most modern programming languages more naturally than relational data. An XML "record" can vary from first normal form in several ways, all of them legitimate means of data modeling:
maxOccurs attribute for particles.The last feature is the most apparent when looking at an XML document. An XML record is a tree, not a table. However, it's actually the second feature which affects the RDB normal forms. It may at first seem that it's all to the good that we can model multiple children (of simple or complex type) directly under a parent, without having to resort to multiple tables and foreign keys just to express a simple one-to-many relationship. This is certainly an excellent feature of XML, but as we look at applying normal forms one through five to XML, this deviation at the foundation will have consequences all the way up the structure.
What we'll see, specifically, is that second and third normal forms are affected only slightly because they deal explicitly with single-valued facts. Fourth and fifth normal forms address multi-valued facts, and it's here that XML's ability to express such facts robustly in hierarchies will force a reconsideration of the meaning and usefulness of normal forms for XML.
WXS and RDB both use keys as means of identifying records. Note the
"PK" notation in the Car table shown
earlier: this is the field whose value uniquely identifies a given Car
record. WXS has a similar feature, the key. Like RDB keys it
can be defined to include one or more fields, in the latter case creating
a composite key.
One difference, a subtle but ultimately important one, is that WXS keys
cannot be defined by the identified type. Instead, they must be defined
at some enclosing scope: specifically, as part of an element that
contains instances of the identified type. Here's another fragment of the
Car schema
that shows the corresponding key definition: the Car key is
defined as part of the Dealership element.
<element name="Dealership">
<complexType>
<sequence>
<element ref="cars:Car" minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
<key name="CarKey">
<selector xpath="./cars:Car" />
<field xpath="cars:VIN" />
</key>
</element>
The choice to place the key definition on a parent,
grandparent, or other ancestor affects the scope over which key values are
expected to be unique. We'll observe the impact of this difference in
part two of this article.
I suppose that most design thinking for relational databases is concerned with the second and third normal forms. These are closely related and both assert that a fact expressed by a field in a table should relate only to the primary key (and to the whole key, not to one of several key fields). To do otherwise either establishes invalid correlations or results in redundant expressions of related data.
For example, if we record available rental housing, and include contact information, but we know that a realtor's name, phone, address, etc., will be the same every time that realtor occurs, then packing all this information into one table would be a poor choice and would break (in this case) third normal form:
| Table: HousingUnit | ||
|---|---|---|
| PK | UnitID | VARCHAR(6) |
| StreetAddress | VARCHAR(64) | |
| City | VARCHAR(24) | |
| State | VARCHAR(2) | |
| ContactName | VARCHAR(32) | |
| ContactPhone | VARCHAR(16) | |
| ContactEMail | VARCHAR(64) | |
The correct, normalized design is shown next. This really amounts to common-sense decomposition and is very similar in spirit to OO design in that it insists on single points of maintenance.
| Table: HousingUnit | Table: ContactInfo | |||||
|---|---|---|---|---|---|---|
| PK | UnitID | VARCHAR(6) | PK | Name | VARCHAR(32) | |
| StreetAddress | VARCHAR(64) | Phone | VARCHAR(16) | |||
| City | VARCHAR(24) | VARCHAR(64) | ||||
| State | VARCHAR(2) | |||||
| FK | ContactName | VARCHAR(32) | ||||
Second and third normal forms apply very well to XML. The fact that XML records are not constrained to a rectilinear shape is of little concern here. For instance, the contact-realtor information in our rental-housing example would likely be modeled as a single complex-type child element, but no matter since the point of these forms is to eliminate redundancy, and repeating a complex-type instance is no more healthy than repeating a series of single values.
The WXS technique that allows a schema to observe second and third
normal forms is the keyref, which establishes an association
between XML nodes. For our purposes a keyref functions very
much like a foreign key: it relates to a specific key, and
it asserts similar constraints on the values of referencing fields, so
that all associations between key-related types can be validated and
navigated.
This leads to our first detailed example of XML normalization. The
rental-housing example discussed above is expanded to a more complete
solution as shown in Housing1.xsd. Note that the HousingUnit
type includes the entire contact-information structure by composition:

<complexType name="HousingUnit">
<sequence>
<element name="bedrooms" type="integer" />
<element name="floor" type="integer" />
<element name="wheelchairAccessible" type="boolean" />
<element name="rent" type="positiveInteger" />
<element name="available" type="date" />
<element name="location" type="h:PhysicalAddress" />
<element name="amenities" minOccurs="0">
<complexType>
<sequence>
<element name="amenity" type="string"
minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
</element>
<element name="contact" type="h:BusinessEntity" />
</sequence>
<attribute name="unitID" type="NCName" />
</complexType>
The problem can be seen in the sample database Listings1.xml. Some contacts are listed for multiple units, and all their information is repeated over and over. This is inefficient; but, what's worse, it threatens consistency. If Judy Trueblood's phone number changes, how can we be assured that it won't be updated in one place and left stale in another? The failure to normalize the XML document leaves us with multiple points of maintenance.
The correction is shown in Housing2.xsd, and in the diagram below. A separate
type is created for realtors, so that each can be recorded just once. The
housing unit can then refer to the realtor, rather than duplicating its
information. (The introduction of this feature as a choice
allows both old and new forms to be used as appropriate.)

<complexType name="HousingUnit">
<sequence>
<element name="bedrooms" type="integer" />
<element name="floor" type="integer" />
<element name="wheelchairAccessible" type="boolean" />
<element name="rent" type="positiveInteger" />
<element name="available" type="date" />
<element name="location" type="h:PhysicalAddress" />
<element name="amenities" minOccurs="0">
<complexType>
<sequence>
<element name="amenity" type="string"
minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
</element>
<choice>
<element name="contact" type="h:BusinessEntity" />
<element name="realtor" type="string" />
</choice>
</sequence>
<attribute name="unitID" type="NCName" />
</complexType>
The parent element HousingUnitList now keeps a list of
housing units as well as realtors. The foreign-key relationship between
HousingUnit and Realtor is asserted here, using
keyref:
<element name="HousingUnitList">
<complexType>
<sequence>
<element name="HousingUnit" type="h:HousingUnit"
minOccurs="0" maxOccurs="unbounded" />
<element name="RealtorList">
<complexType>
<sequence>
<element name="Realtor" type="h:Realtor"
minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
<key name="RealtorKey">
<selector xpath="./Realtor" />
<field xpath="name" />
</key>
</element>
</sequence>
</complexType>
<key name="UnitIDKey">
<selector xpath="Unit" />
<field xpath="@unitID" />
</key>
<keyref name="HousingUnitToRealtor" refer="h:RealtorKey">
<selector xpath="./Unit" />
<field xpath="realtor" />
</keyref>
</element>
|
More from XML Schema Clinic |
As a result of this change the redundancy in recording realtor information has been eliminated. Realtors are referenced only by their key values, and any associated information can be updated in one place. Thus, Judy Trueblood can be assured she won't be missing any calls. The revised instance document Listings2.xml shows this. (Incidentally, here is the XSLT that was used to migrate from one schema to the next: Migration.xsl.)
So the key concept of reducing redundancy through key association is alive and well in W3C XML Schema design. While I'd love to finish on this bright note, I must report that there are devils inhabiting the details. In part two of this article, I'll point them out and discuss the implications for WXS design, as well as addressing the subtler fourth and fifth normal forms.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.