Normalizing XML, Part 2
by Will Provost
|
Pages: 1, 2, 3
XML Composition Obviates Fourth Normal Form
Having confronted some of the subtleties of dealing with second and third normal forms, we now proceed to fourth normal form, which is the first form to deal with multivalued facts. Fourth normal form prohibits overlapping multivalued facts in one record. That is to say, a table that attempts singlehandedly to implement multiple one-to-many relationships breaks fourth normal form and should be decomposed into one table for each such relationship. For example, if we want to associate multiple phone numbers and multiple e-mail addresses with individual people, we might be tempted to pack this information into one table, such as:
| Table: ContactInfo | ||
|---|---|---|
| PersonID | INTEGER | |
| PK | PhoneNumber | VARCHAR(24) |
| PK | EMailAddress | VARCHAR(128) |
This record structure would capture necessary information, but the table design leaves the relationship between phone number and e-mail address unclear. Even with our knowledge that they are unrelated, we'll have maintenance troubles. If a phone number is removed, the e-mail address in that same row must be preserved, so the PhoneNumber field would have to be left NULL -- even if another record might have a phone number and no e-mail address! And what's the proper primary-key definition? This example takes a safe approach, but does this reflect the real state of things? How would foreign keys reference this table?
All of these problems are manageable, but what looked like an elegant table design is now exposed as a kludgey solution. Clearly, the following design is better:
| Table: PhoneNumbers | Table: EMailAddresses | |||||
|---|---|---|---|---|---|---|
| PersonID | INTEGER | PersonID | INTEGER | |||
| PK | PhoneNumber | VARCHAR(24) | PK | EMailAddress | VARCHAR(128) |
Note that fourth normal form is interesting only if first normal form is strictly observed. In XML it isn't. As we discussed in part one of this article, XML can vary from first normal form in many ways. Most significant here is the fact that an XML record can easily manage independent multivalued facts, using composition. XML data, once again, is not strictly rectangular, and fourth normal form has no real meaning when applied to an XML tree. The schema for the contact-info model is therefore a little more natural in WXS; it might include additional personal information as well, which is presumably a bad idea in the relational database solution above:
<complexType name="ContactInfo">
<sequence>
<element name="Name" type="string" />
<element name="SSID" type="myNS:MySSIDType" />
<element name="FavoriteColor" type="myNS:MyColorType" />
<element name="PhoneNumber" type="string"
minOccurs="0" maxOccurs="unbounded" />
<element name="EMailAddress" type="string"
minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
Note that querying on structures such as
PersonalInfo would be a bit tricky with SQL -- which,
also likes rectangles. This is one of many scenarios motivating
the development of XQuery, which can
retrieve and parse tree-shaped result sets.
Fifth Normal Form Strikes Back!
|
More from XML Schema Clinic |
So if fourth normal form is irrelevant, can we safely ignore fifth normal form? Not really. Where fourth normal form concerns unrelated multivalued facts, fifth normal form addresses the issue of how to handle multivalued facts that are related by some additional rule. Where there are cycles of relationships between more than two record types, fifth normal form enforces complete decomposition of those relationships.
The spirit of this rule certainly applies well to XML. Let's say we need to record information on musicians: what instruments they play and what styles of music they know. If instrument and style are independent, then this is a problem of the fourth normal form. But let's add the rule that only certain instruments are appropriate to certain musical styles. Do we list each pairing of instrument and style in a collection under a musician? This would be appropriate if the pairings were chosen by individual musicians, but if we're stating a general rule that excludes, say, rock and roll clarinet playing, then we should capture this in a separate tree (or matrix, really ) and keep the relationships to instrument and style independent under each musician.
The benefits of fifth normal form in storage efficiency are a little harder to quantify for XML than for relational databases, but they are certainly there, as well. The broader point of fifth normal form, as with all the others, is to avoid redundant statements of fact, and that is as valid for XML as for relational data.
- why the namespace in
?
2003-10-30 16:51:24 Jiho Han - Author's reply
2003-11-05 12:24:19 Will Provost - possible typo?
2003-01-27 11:05:18 Joe Hubert - possible typo?
2003-01-27 12:30:14 Will Provost