Normalizing XML, Part 2
by Will Provost
|
Pages: 1, 2, 3
XML Composition Obviates Fourth Normal Form
Having confronted some of the subtleties of dealing with second and third normal forms, we now proceed to fourth normal form, which is the first form to deal with multivalued facts. Fourth normal form prohibits overlapping multivalued facts in one record. That is to say, a table that attempts singlehandedly to implement multiple one-to-many relationships breaks fourth normal form and should be decomposed into one table for each such relationship. For example, if we want to associate multiple phone numbers and multiple e-mail addresses with individual people, we might be tempted to pack this information into one table, such as:
| Table: ContactInfo | ||
|---|---|---|
| PersonID | INTEGER | |
| PK | PhoneNumber | VARCHAR(24) |
| PK | EMailAddress | VARCHAR(128) |
This record structure would capture necessary information, but the table design leaves the relationship between phone number and e-mail address unclear. Even with our knowledge that they are unrelated, we'll have maintenance troubles. If a phone number is removed, the e-mail address in that same row must be preserved, so the PhoneNumber field would have to be left NULL -- even if another record might have a phone number and no e-mail address! And what's the proper primary-key definition? This example takes a safe approach, but does this reflect the real state of things? How would foreign keys reference this table?
All of these problems are manageable, but what looked like an elegant table design is now exposed as a kludgey solution. Clearly, the following design is better:
| Table: PhoneNumbers | Table: EMailAddresses | |||||
|---|---|---|---|---|---|---|
| PersonID | INTEGER | PersonID | INTEGER | |||
| PK | PhoneNumber | VARCHAR(24) | PK | EMailAddress | VARCHAR(128) |
Note that fourth normal form is interesting only if first normal form is strictly observed. In XML it isn't. As we discussed in part one of this article, XML can vary from first normal form in many ways. Most significant here is the fact that an XML record can easily manage independent multivalued facts, using composition. XML data, once again, is not strictly rectangular, and fourth normal form has no real meaning when applied to an XML tree. The schema for the contact-info model is therefore a little more natural in WXS; it might include additional personal information as well, which is presumably a bad idea in the relational database solution above:
<complexType name="ContactInfo">
<sequence>
<element name="Name" type="string" />
<element name="SSID" type="myNS:MySSIDType" />
<element name="FavoriteColor" type="myNS:MyColorType" />
<element name="PhoneNumber" type="string"
minOccurs="0" maxOccurs="unbounded" />
<element name="EMailAddress" type="string"
minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
Note that querying on structures such as
PersonalInfo would be a bit tricky with SQL -- which,
also likes rectangles. This is one of many scenarios motivating
the development of XQuery, which can
retrieve and parse tree-shaped result sets.
Fifth Normal Form Strikes Back!
|
More from XML Schema Clinic |
So if fourth normal form is irrelevant, can we safely ignore fifth normal form? Not really. Where fourth normal form concerns unrelated multivalued facts, fifth normal form addresses the issue of how to handle multivalued facts that are related by some additional rule. Where there are cycles of relationships between more than two record types, fifth normal form enforces complete decomposition of those relationships.
The spirit of this rule certainly applies well to XML. Let's say we need to record information on musicians: what instruments they play and what styles of music they know. If instrument and style are independent, then this is a problem of the fourth normal form. But let's add the rule that only certain instruments are appropriate to certain musical styles. Do we list each pairing of instrument and style in a collection under a musician? This would be appropriate if the pairings were chosen by individual musicians, but if we're stating a general rule that excludes, say, rock and roll clarinet playing, then we should capture this in a separate tree (or matrix, really ) and keep the relationships to instrument and style independent under each musician.
The benefits of fifth normal form in storage efficiency are a little harder to quantify for XML than for relational databases, but they are certainly there, as well. The broader point of fifth normal form, as with all the others, is to avoid redundant statements of fact, and that is as valid for XML as for relational data.
Share your questions or comments on this article in our forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- why the namespace in
?
2003-10-30 16:51:24 Jiho Han [Reply]
I'm a little lost with the whole namespace thing.
In Listing2.xml, the realtor elements inside HousingUnit elements are qualified with xmlns:h="http://www.mybigrealtyco.com/WS/Housing". Also RealtorList element is qualified with the same namespace URI. Why is that?
Aren't these elements all defined in the same schema?
- Author's reply
2003-11-05 12:24:19 Will Provost [Reply]
No, you're right -- the xmlns:h bits are pointless. Sorry to have left them in, as they're a bit misleading. (I created Listing2.xml using an XSLT tool that generated these extra declarations and failed to notice them before posting the code.) Since everything is using the default namespace declaration at the root anyway, the document is valid, but the declarations could be removed.
- Author's reply
- possible typo?
2003-01-27 11:05:18 Joe Hubert [Reply]
The Housing2.xsd schema includes this snippet within the HousingUnitList element:
<key name="UnitIDKey">
<selector xpath="Unit" />
<field xpath="@unitID" />
</key>
Should the xpath be "HousingUnit" instead of "Unit"?
- possible typo?
2003-01-27 12:30:14 Will Provost [Reply]
That's correct. In fact, in double-checking this I discovered that there are also some missing namespace prefixes; the result was that the instance document was being incorrectly passed because the field xpath was resolving to an empty set.
The correct lines for the keys and keyref are as follows:
<key name="RealtorKey">
<selector xpath="h:Realtor" />
<field xpath="h:contact/h:name" />
</key>
</element>
</sequence>
</complexType>
<key name="UnitIDKey">
<selector xpath="h:HousingUnit" />
<field xpath="@unitID" />
</key>
<keyref name="HousingUnitToRealtor" refer="h:RealtorKey">
<selector xpath="h:HousingUnit" />
<field xpath="h:realtor" />
</keyref>
</element>
</schema>
Good catch! Thanks.
- possible typo?
