Normalizing XML, Part 1

November 13, 2002

As regular readers of the XML Schema Clinic likely know, I tend to view the world of XML through object-oriented glasses. For this installment, though, we're reaching out to the relational data folks, switching lenses for one eye at least. The goal is to see what relational concepts we can usefully apply to XML. Can the normal forms that guide database design be applied meaningfully to XML document design?

Note that we're not talking about mapping relational data to XML. Instead, we assume that XML is the native language for data expression, and attempt to apply the concepts of normalization to schema design. The discussion is organized loosely around the progression of normal forms, first to fifth. As we'll see, these forms won't apply precisely to XML, but we can adhere to the law's spirit, if not its letter. It's possible to develop guidelines for designing W3C XML Schema (WXS) that achieve the goals of normalization:

Eliminate ambiguity in data expression
Minimize redundancy (some would say, "eliminate all redundancy")
Facilitate preservation of data consistency
Allow for rational maintenance of data

In this first of two parts, we'll consider the first through third normal forms, and observe that while there are important differences between the XML and relational models, much of the thinking that commonly goes into RDB design can be applied to WXS design as well.

XML Composition vs. First Normal Form

The first normal form in relational theory is so fundamental it's often forgotten. It states that every record in a table must have the same number of fields. Relational data is rectilinear; a table is a matrix of rows and columns.

Table: Car
	Make	VARCHAR(64)
	Model	VARCHAR(64)
	Year	INTEGER
	Color	VARCHAR(64)
PK	VIN	VARCHAR(6)

Of course, XML is hierarchical by nature. It can capture matrices easily, though, making the table a parent of the row and the row a parent of the field, as shown in the schema fragment below. (Fields could also be expressed as attributes of the row element.)

<element name="Car">
  <complexType>
    <sequence>
      <element name="Make" type="string" />
      <element name="Model" type="string" />
      <element name="Year" type="integer" />
      <element name="Color" type="string" />
      <element name="VIN"><simpleType>
        <restriction base="string">
          <pattern value="[A-Z]{2}[0-9]{4}" />
        </restriction>
      </simpleType></element>
    </sequence>
  </complexType>
</element>

The tricky part is that XML can also render non-rectilinear "shapes" for data, thus fitting the type systems of most modern programming languages more naturally than relational data. An XML "record" can vary from first normal form in several ways, all of them legitimate means of data modeling:

It can have varying numbers of fields (though this is not much different from allowing NULLs in RDB fields) and default values for attributes.
It can have multiple values for a field, through the maxOccurs attribute for particles.
It can have choices of field types instead of a straight sequence or conjunction.
Fields can be of complex type.

The last feature is the most apparent when looking at an XML document. An XML record is a tree, not a table. However, it's actually the second feature which affects the RDB normal forms. It may at first seem that it's all to the good that we can model multiple children (of simple or complex type) directly under a parent, without having to resort to multiple tables and foreign keys just to express a simple one-to-many relationship. This is certainly an excellent feature of XML, but as we look at applying normal forms one through five to XML, this deviation at the foundation will have consequences all the way up the structure.

What we'll see, specifically, is that second and third normal forms are affected only slightly because they deal explicitly with single-valued facts. Fourth and fifth normal forms address multi-valued facts, and it's here that XML's ability to express such facts robustly in hierarchies will force a reconsideration of the meaning and usefulness of normal forms for XML.

Primary Keys

WXS and RDB both use keys as means of identifying records. Note the "PK" notation in the Car table shown earlier: this is the field whose value uniquely identifies a given Car record. WXS has a similar feature, the key. Like RDB keys it can be defined to include one or more fields, in the latter case creating a composite key.

One difference, a subtle but ultimately important one, is that WXS keys cannot be defined by the identified type. Instead, they must be defined at some enclosing scope: specifically, as part of an element that contains instances of the identified type. Here's another fragment of the Car schema that shows the corresponding key definition: the Car key is defined as part of the Dealership element.

<element name="Dealership">
  <complexType>
    <sequence>
      <element ref="cars:Car" minOccurs="0" maxOccurs="unbounded" />
    </sequence>
  </complexType>
  <key name="CarKey">
    <selector xpath="./cars:Car" />
    <field xpath="cars:VIN" />
  </key>
</element>

The choice to place the key definition on a parent, grandparent, or other ancestor affects the scope over which key values are expected to be unique. We'll observe the impact of this difference in part two of this article.

XML Association and Second and Third Normal Forms

I suppose that most design thinking for relational databases is concerned with the second and third normal forms. These are closely related and both assert that a fact expressed by a field in a table should relate only to the primary key (and to the whole key, not to one of several key fields). To do otherwise either establishes invalid correlations or results in redundant expressions of related data.

For example, if we record available rental housing, and include contact information, but we know that a realtor's name, phone, address, etc., will be the same every time that realtor occurs, then packing all this information into one table would be a poor choice and would break (in this case) third normal form:

Table: HousingUnit
PK	UnitID	VARCHAR(6)
	StreetAddress	VARCHAR(64)
	City	VARCHAR(24)
	State	VARCHAR(2)
	ContactName	VARCHAR(32)
	ContactPhone	VARCHAR(16)
	ContactEMail	VARCHAR(64)

The correct, normalized design is shown next. This really amounts to common-sense decomposition and is very similar in spirit to OO design in that it insists on single points of maintenance.

Table: HousingUnit			Table: ContactInfo
PK	UnitID	VARCHAR(6)	PK	Name	VARCHAR(32)
	StreetAddress	VARCHAR(64)		Phone	VARCHAR(16)
	City	VARCHAR(24)		EMail	VARCHAR(64)
	State	VARCHAR(2)
FK	ContactName	VARCHAR(32)

Second and third normal forms apply very well to XML. The fact that XML records are not constrained to a rectilinear shape is of little concern here. For instance, the contact-realtor information in our rental-housing example would likely be modeled as a single complex-type child element, but no matter since the point of these forms is to eliminate redundancy, and repeating a complex-type instance is no more healthy than repeating a series of single values.

The WXS technique that allows a schema to observe second and third normal forms is the keyref, which establishes an association between XML nodes. For our purposes a keyref functions very much like a foreign key: it relates to a specific key, and it asserts similar constraints on the values of referencing fields, so that all associations between key-related types can be validated and navigated.

This leads to our first detailed example of XML normalization. The rental-housing example discussed above is expanded to a more complete solution as shown in Housing1.xsd. Note that the HousingUnit type includes the entire contact-information structure by composition:

<complexType name="HousingUnit">
  <sequence>
    <element name="bedrooms" type="integer" />
    <element name="floor" type="integer" />
    <element name="wheelchairAccessible" type="boolean" />
    <element name="rent" type="positiveInteger" />
    <element name="available" type="date" />
    <element name="location" type="h:PhysicalAddress" />
    <element name="amenities" minOccurs="0">
      <complexType>
        <sequence>
          <element name="amenity" type="string" 
            minOccurs="0" maxOccurs="unbounded" />
        </sequence>
      </complexType>
    </element>
    <element name="contact" type="h:BusinessEntity" />
  </sequence>
  <attribute name="unitID" type="NCName" />
</complexType>

The problem can be seen in the sample database Listings1.xml. Some contacts are listed for multiple units, and all their information is repeated over and over. This is inefficient; but, what's worse, it threatens consistency. If Judy Trueblood's phone number changes, how can we be assured that it won't be updated in one place and left stale in another? The failure to normalize the XML document leaves us with multiple points of maintenance.

The correction is shown in Housing2.xsd, and in the diagram below. A separate type is created for realtors, so that each can be recorded just once. The housing unit can then refer to the realtor, rather than duplicating its information. (The introduction of this feature as a choice allows both old and new forms to be used as appropriate.)

<complexType name="HousingUnit">
  <sequence>
    <element name="bedrooms" type="integer" />
    <element name="floor" type="integer" />
    <element name="wheelchairAccessible" type="boolean" />
    <element name="rent" type="positiveInteger" />
    <element name="available" type="date" />
    <element name="location" type="h:PhysicalAddress" />
    <element name="amenities" minOccurs="0">
      <complexType>
        <sequence>
          <element name="amenity" type="string" 
            minOccurs="0" maxOccurs="unbounded" />
        </sequence>
      </complexType>
    </element>
    <choice>
      <element name="contact" type="h:BusinessEntity" />
      <element name="realtor" type="string" />
    </choice>
  </sequence>
  <attribute name="unitID" type="NCName" />
</complexType>

The parent element HousingUnitList now keeps a list of housing units as well as realtors. The foreign-key relationship between HousingUnit and Realtor is asserted here, using keyref:

<element name="HousingUnitList">
  <complexType>
    <sequence>
      <element name="HousingUnit" type="h:HousingUnit" 
        minOccurs="0" maxOccurs="unbounded" />
      <element name="RealtorList">
        <complexType>
          <sequence>
            <element name="Realtor" type="h:Realtor"
              minOccurs="0" maxOccurs="unbounded" />
          </sequence>
        </complexType>
        <key name="RealtorKey">
          <selector xpath="./Realtor" />
          <field xpath="name" />
        </key>
      </element>
    </sequence>
  </complexType>
  <key name="UnitIDKey">
    <selector xpath="Unit" />
    <field xpath="@unitID" />
  </key>
  <keyref name="HousingUnitToRealtor" refer="h:RealtorKey">
    <selector xpath="./Unit" />
    <field xpath="realtor" />
  </keyref>
</element>

The Plot Thickens

So the key concept of reducing redundancy through key association is alive and well in W3C XML Schema design. While I'd love to finish on this bright note, I must report that there are devils inhabiting the details. In part two of this article, I'll point them out and discuss the implications for WXS design, as well as addressing the subtler fourth and fifth normal forms.