Normalizing XML, Part 2

December 4, 2002

When one of the many father-son carpentry projects of my childhood would make its inevitable leap from confidence to confusion, the elder Provost's face would acquire a strangely bemused quality as he pronounced the day's lesson: "Nothing is simple."

For better or worse, it's on that note that we resume our discussion of the applicability of concepts of data normalization to XML document design. In part one of this article, we observed that while XML's hierarchical model is somewhat at odds with the rectangular structure of relational data, the goals of data normalization as stated in the relational are certainly worthwhile. We've also seen the usefulness of the normal forms of relational theory -- perhaps not applied literally, but posed as challenges to find equally strict guidelines for XML design. The basic trick of RDB normalization -- the foreign-key relationship -- has been duplicated for W3C XML Schema (WXS) using the key and keyref components to avoid in a model. Everything's just humming along.

But nothing is simple. In this second and final part, we'll look at some of the subtler issues of "normalized" XML data design, as well as complete our run through the normal forms to see how well they apply to XML.

When Not to Normalize

The rental-housing example from part one shows the basic technique of defining an XML association between complex types so that instances of one of those types can be referenced by multiple instances of the other. (See the schema Housing2.xsd and instance document Listings2.xml.) This addresses the goals of the second and third normal forms by eliminating redundant statements of fact in the XML document.

It's easy to carry this sort of decomposition too far, however, and it's especially tempting for those with relational backgrounds to overuse XML association. In relational database design, foreign keys are commonly used for either of two purposes -- on the one hand, one-to-many relationships; and, on the other, many-to-many or many-to-one relationships, in which multiple objects share some other value, such that if the referenced value changes, it changes for all referencing objects. Because the statement of a foreign-key relationship alone does not distinguish these cases, SQL includes semantics for "cascading delete" to assert a truly compositional relationship between tables.

In XML, though, multiple cardinality can be managed through simple composition. (Here again is that fundamental difference in data shape we discussed in part one.) So, in XML, the mere fact of a one-to-many relationship does not in itself call for association through keys. A good rule of thumb for relational database designers: if you would have applied a cascading delete to a foreign-key relationship, then you're talking about composition in XML.

Here's a quick example: a product-order model including Order records and separate Item records:

Table: Order			Table: Item
PK	OrderNumber	INTEGER	PK	SKU	VARCHAR(12)
	CustomerName	VARCHAR(64)		Name	VARCHAR(32)
	Date	DATETIME		Price	CURRENCY
				Quantity	INTEGER
			FK	OrderNumber	INTEGER

The two-table decomposition is made necessary by the multiplicity of Items per Order; this is not a matter of association but of composition. (The likely association is actually between Item and a third type, Product, using SKU as a key, yielding an attributed, many-to-many relationship from Order to Product.)

An XML document would express the items as children of the order, as shown below. This model enforces the compositional relationship in ways that a key/keyref association would not -- in the latter case it would be possible for two Orders to share a line Item.

<complexType name="Order">

  <sequence>

    <element name="OrderNumber" type="integer" />

    <element name="CustomerName" type="string" />

    <element name="Date" type="date" />

    <element name="Item">

      <complexType>

        <sequence>

          <element name="SKU" type="string" />

          <element name="Name" type="string" />

          <element name="Price" type="decimal" />

          <element name="Quantity" type="integer" />

        </sequence>

      </complexType>

    </element>

  </sequence>

</complexType>

Scope of Uniqueness

Another important difference between relational database schemas and WXS concerns the scope of key uniqueness. A primary key in a relational database must be unique within its database instance. By contrast, a WXS key is defined for some element to govern uniqueness of key values within an instance of that element. Thus it is simple enough to assert uniqueness for an element or attribute at some scope smaller than the instance document. This allows "global" uniqueness to be enforced through progressively smaller scopes, such that the global identifier for a datum is a path consisting of several tokens, rather than a single value.

An airline, for example, might record its staffing schedule in a hierarchy from Airline to Flight to Date to Position. The last element would include a position name and the name of the employee filling that position. (Yes, there'd probably be an employee key, instead, but we have to stop somewhere.)

If we want to assert uniqueness over position name, we'd have to do so only for a certain flight on a certain date; that is, while we can't have two captains on the plane, we certainly need one for each plane that leaves the ground. So the "path" to a particular staffing fact would be expressed as //airline/flight/date/position/Employee. If this looks a lot like XPath, no wonder. This path-based addressing fits XML's hierarchical structure, and the ability to define WXS keys at subdocument scopes supports paths and relieves the document designer from the need to attach a global ID -- which seldom has any domain relevance -- to every datum, as is common practice in relational database design.

There is a downside to this facility, however. The trick is that a keyref can't be defined to traverse multiple scopes. It can't reference multiple keys, only one key through its refer attribute. So a keyref must work at the same scope as the referenced key in order to be effective. This poses some problems when defining associations.

Consider a simple workflow model, in which an Actor defines available inputs and outputs by name and type, and a Flow defines connections from Actor to Actor, specifying the wiring from source outputs to destination inputs. Not shown in the UML below is the encompassing element Process, which collects Flow and Actor instances to define some abstract workflow.

A Flow references two Actors; for each of these references a keyref is defined, as shown in this fragment of the total schema Workflow1.xsd:

<element name="process" type="work:Process" >

  <key name="ActorKey" >

    <selector xpath="./work:actor" />

    <field xpath="work:name" />

  </key>

  <key name="FlowKey" >

    <selector xpath="./work:flow" />

    <field xpath="work:sourceActor" />

    <field xpath="work:destinationActor" />

  </key>

  <keyref name="FlowSource" refer="work:ActorKey" >

    <selector xpath="./work:flow/work:sourceActor" />

    <field xpath="." />

  </keyref>

  <keyref name="FlowDestination" refer="work:ActorKey" >

    <selector xpath="./work:flow/work:destinationActor" />

    <field xpath="." />

  </keyref>

</element>

We encounter a problem at the next level of the hierarchy. How can we assert that a Connection references two Endpoints? Endpoint instances must be named uniquely, but only within each Actor instance, as shown above. If we try to reference this key from a parent scope (such as Process) or a sibling scope (Flow), there's no way to express that we want to reference an Endpoint by name within a particular Actor. (We might hope that the parser would be smart enough to narrow the scope automatically to the Actor instance referenced by the parent Flow, but this is asking too much, and certainly isn't supported in the WXS specification.) Owing to this limitation, the schema does not assert any association from Connection to Endpoint; if the input or output names in RequestMedicalProcedure1.xml were not accurate, validation would not catch the problem.

Possible workarounds include:

Breaking compositions in the referenced structure into associations, making the corresponding keys global in scope and thus easy to reference. For instance, Endpoints could be defined outside of, and referenced by, Actors. This gains a possibly-valid reference (e.g. from Connection to Endpoint) but loses the aforementioned advantage of composition.
A global ID could be defined for each referenced element. This preserves composition while allowing global key reference. This is the approach taken in Workflow2.xsd; note the new ID attribute, which must be managed manually or by the application or some authoring tool.
Leave the WXS domain to enforce and to navigate the desired association. For instance, this validating transform could enforce the missing workflow constraint. Application logic would also have to be written to help navigate from a Connection to the corresponding Endpoints, probably as DOM nodes.

XML Composition Obviates Fourth Normal Form

Having confronted some of the subtleties of dealing with second and third normal forms, we now proceed to fourth normal form, which is the first form to deal with multivalued facts. Fourth normal form prohibits overlapping multivalued facts in one record. That is to say, a table that attempts singlehandedly to implement multiple one-to-many relationships breaks fourth normal form and should be decomposed into one table for each such relationship. For example, if we want to associate multiple phone numbers and multiple e-mail addresses with individual people, we might be tempted to pack this information into one table, such as:

Table: ContactInfo
	PersonID	INTEGER
PK	PhoneNumber	VARCHAR(24)
PK	EMailAddress	VARCHAR(128)

This record structure would capture necessary information, but the table design leaves the relationship between phone number and e-mail address unclear. Even with our knowledge that they are unrelated, we'll have maintenance troubles. If a phone number is removed, the e-mail address in that same row must be preserved, so the PhoneNumber field would have to be left NULL -- even if another record might have a phone number and no e-mail address! And what's the proper primary-key definition? This example takes a safe approach, but does this reflect the real state of things? How would foreign keys reference this table?

All of these problems are manageable, but what looked like an elegant table design is now exposed as a kludgey solution. Clearly, the following design is better:

Table: PhoneNumbers				Table: EMailAddresses
	PersonID	INTEGER			PersonID	INTEGER
PK	PhoneNumber	VARCHAR(24)		PK	EMailAddress	VARCHAR(128)

Note that fourth normal form is interesting only if first normal form is strictly observed. In XML it isn't. As we discussed in part one of this article, XML can vary from first normal form in many ways. Most significant here is the fact that an XML record can easily manage independent multivalued facts, using composition. XML data, once again, is not strictly rectangular, and fourth normal form has no real meaning when applied to an XML tree. The schema for the contact-info model is therefore a little more natural in WXS; it might include additional personal information as well, which is presumably a bad idea in the relational database solution above:

<complexType name="ContactInfo">

  <sequence>

    <element name="Name" type="string" />

    <element name="SSID" type="myNS:MySSIDType" />

    <element name="FavoriteColor" type="myNS:MyColorType" />

    <element name="PhoneNumber" type="string" 

      minOccurs="0" maxOccurs="unbounded" />

    <element name="EMailAddress" type="string" 

      minOccurs="0" maxOccurs="unbounded" />

  </sequence>

</complexType>

Note that querying on structures such as PersonalInfo would be a bit tricky with SQL -- which, also likes rectangles. This is one of many scenarios motivating the development of XQuery, which can retrieve and parse tree-shaped result sets.