Normalizing XML, Part 2
by Will Provost
|
Pages: 1, 2, 3
Scope of Uniqueness
Another important difference between relational database schemas and
WXS concerns the scope of key uniqueness. A primary key in a relational
database must be unique within its database instance. By
contrast, a WXS key is defined for some
element to govern uniqueness of key values
within an instance of that element. Thus it is simple
enough to assert uniqueness for an element or attribute at some
scope smaller than the instance document. This allows
"global" uniqueness to be enforced through
progressively smaller scopes, such that the global identifier
for a datum is a path consisting of several tokens,
rather than a single value.
An airline, for example, might record its staffing schedule in
a
hierarchy from Airline to Flight to
Date to Position. The last element
would include a position name and the name of the employee
filling that position. (Yes, there'd probably be an employee
key, instead, but we have to stop somewhere.)
If we want to assert uniqueness over position name, we'd have
to do so only for a certain flight on a certain date; that is,
while we can't have two captains on the plane, we certainly need
one for each plane that leaves the ground. So the
"path" to a particular staffing fact would be expressed
as
//airline/flight/date/position/Employee.
If this looks a lot like XPath, no wonder. This path-based
addressing fits XML's hierarchical structure, and the ability to
define WXS keys at subdocument scopes supports paths
and relieves the document designer from the need to attach a
global ID -- which seldom has any domain relevance -- to every
datum, as is common practice in relational database design.
There is a downside to this facility, however. The trick is
that a keyref can't be defined to traverse multiple
scopes. It can't reference multiple keys, only one key through
its refer attribute. So a keyref must
work at the same scope as the referenced key in order
to be effective. This poses some problems when defining
associations.
Consider a simple workflow model, in which an
Actor defines available inputs and outputs by name
and type, and a Flow defines connections from
Actor to Actor, specifying the wiring
from source outputs to destination inputs. Not shown in the UML
below is the encompassing element Process, which
collects Flow and Actor instances to
define some abstract workflow.

A Flow references two Actors; for
each of these references a keyref is defined, as
shown in this fragment of the total schema Workflow1.xsd:
<element name="process" type="work:Process" >
<key name="ActorKey" >
<selector xpath="./work:actor" />
<field xpath="work:name" />
</key>
<key name="FlowKey" >
<selector xpath="./work:flow" />
<field xpath="work:sourceActor" />
<field xpath="work:destinationActor" />
</key>
<keyref name="FlowSource" refer="work:ActorKey" >
<selector xpath="./work:flow/work:sourceActor" />
<field xpath="." />
</keyref>
<keyref name="FlowDestination" refer="work:ActorKey" >
<selector xpath="./work:flow/work:destinationActor" />
<field xpath="." />
</keyref>
</element>
We encounter a problem at the next level of the hierarchy. How
can we assert that a Connection references two
Endpoints? Endpoint instances must be named
uniquely, but only within each Actor instance, as
shown above. If we try to reference this key from a parent scope
(such as Process) or a sibling scope
(Flow), there's no way to express that we want to
reference an Endpoint by name within a particular
Actor. (We might hope that the parser would be
smart enough to narrow the scope automatically to the
Actor instance referenced by the parent
Flow, but this is asking too much, and certainly
isn't supported in the WXS specification.) Owing to this
limitation, the schema does not assert any association from
Connection to Endpoint; if the input or
output names in RequestMedicalProcedure1.xml were not accurate,
validation would not catch the problem.
Possible workarounds include:
Breaking compositions in the referenced structure into associations, making the corresponding keys global in scope and thus easy to reference. For instance,
Endpoints could be defined outside of, and referenced by,Actors. This gains a possibly-valid reference (e.g. fromConnectiontoEndpoint) but loses the aforementioned advantage of composition.
A global ID could be defined for each referenced element. This preserves composition while allowing global key reference. This is the approach taken in Workflow2.xsd; note the new
IDattribute, which must be managed manually or by the application or some authoring tool.Leave the WXS domain to enforce and to navigate the desired association. For instance, this validating transform could enforce the missing workflow constraint. Application logic would also have to be written to help navigate from a
Connectionto the correspondingEndpoints, probably as DOM nodes.