Top Ten Tips to Using XPath and XPointer
by John E. Simpson
|
Pages: 1, 2, 3, 4
8. Remember to keep namespaces straight (in both XPath and XPointer applications).
While namespaces in XML continue to make for spirited discussion in some circles, they're for the foreseeable future an inevitable feature of the XML landscape.
In XPath applications, you need to understand three terms: the qualified name, the local name, and the expanded name.
The qualified name (often abbreviated "qname") of an element or attribute
is the node's name, including the namespace prefix (if any), as it
appears in the document being accessed via XPath. If an element's name
in a source document is book:section, that's its qname.
A node's local name is the name as it appears in the source document,
shorn of any namespace prefix. (You might call this the un-qname.) For
an element in a document named book:section, the local name
is simply section.
The expanded name of an element or attribute is what most namespace-aware
applications really care about (unless you instruct them not to). It's a
combination of the URI associated with the namespace prefix, plus the
local name. Assume the book: namespace prefix is declared
like this:
xmlns:book="http://my.example.org/namespaces/book"
Then the expanded name of the book:section element is the combination
of the strings "http://my.example.org/namespaces/book" and
"section". Exactly how the application builds the expanded name is up to
the application's developer. Most seem to follow a de-facto standard of
enclosing the namespace URI in curly braces, { and
} characters, followed by the local name. Such an
application would thus represent the expanded name of the
book:section element as follows:
{http://my.example.org/namespaces/book}section
The important thing to you as a user of XPath isn't the exact algorithm for
building an expanded name (which in any case is directly accessible only
within the processor itself, not to XPath expressions). The important
thing is that the processor will in general use only the expanded name
if it needs to disambiguate element or attribute names. Consider a
document which has two elements, xsl:template and
transform:template. Their local names are identical; the
only way to tell if the element names really are identical is to
examine their namespace URIs as well. If both the xsl: and
the transform: prefix are bound to the namespace URI
"http://www.w3.org/1999/XSL/Transform", then the two elements have the
same "name" even though their prefixes are different.
One implication of all this is that using the XPath name() function
to return an element or attribute's "name" is a little deceptive: it
returns the qname. And no matter how unique its qname, a given element
or attribute may in fact have a name identical to others' simply
because their namespace URIs match, even when the prefixes are
different.
When coding XPointers, remember that the vocabulary -- and hence the namespaces
-- of the document containing the XPointers will probably be quite different
from those of the document(s) being pointed to. To be absolutely sure
that the XPointer processor keeps it all straight, identify the
namespace(s) for target document vocabularies using the
xmlns() XPointer scheme. For instance:
xmlns(book=http://my.example.org/namespaces/book)
xpointer(//book:section)
Note: code below is a single line, split across two for formatting reasons.
9. Don't forget processor efficiency in XPath and XPointer.
The authors of XML books and articles have it easy, in one respect: the XML documents they use for examples don't generally need to be very long or complex.
Real applications rarely have that luxury. Source documents may contain many thousands of elements, just to cite the obvious case; throw in a mixture of comments, PIs, and voluminous text nodes, and you may find the streets of your location paths paved with molasses. Controlling this is to a large extent out of your hands. You can't rejigger the processor's internals, after all. (On the other hand, some processors may allow you to use parameters or command-line arguments to encourage them to behave in ways optimized for particular source document structures.) But one XPath optimization is easy -- it just requires you to surrender a particularly lazy habit.
The habit in question is excessive use of the descendant-or-self::
axis when you know the name of the target element (the node test) which
follows it. It's particularly tempting to fall back on this habit
because of the XPath // shortcut (technically a shortcut
for the /descendant-or-self::node()/ location
step). Considering a document even as simple as this should make the
point:
<dictionary>
<letter>
<forms>
<form type="upper">A</form>
<form type="lower">a</form>
</forms>
<word>
<spelling>aardvark</spelling>
<part_of_speech>noun</part_of_speech>
<definition>a nocturnal mammal of southern Africa
with a tubular snout and a long tongue</definition>
</word>
</letter>
</dictionary>
Both of the following location paths locate the definition element:
//definition
/dictionary/letter/word/definition
The second is a much more direct route to the desired result. It
leads the processor down the tree with no side trips, right to the
definition element. The first, in contrast, takes a
leisurely stroll through all descendants of the root node -- picking up
each one in turn and mulling it over ("Hmm, is this descendant a
definition element...?") before proceeding even further
through the tree. This includes irrelevant detours into the
forms branch of the tree and to the spelling
and part_of_speech siblings of the definition
node.
Of course, for this extremely simple example document, the difference in processing time will be negligible. Turn this document into an entire dictionary, though, and the difference will be considerable. It's true that coding yard-long location paths into large documents can be both tedious and error-prone, certainly no one's idea of fun; but if huge gains in performance result from it, well, it's hard to argue in favor of fun.
As for XPointer, not only can you minimize (if not eliminate) your use of the
// shortcut; you can also fall back on alternative ways of
seeking content which aren't dependent on XPath at all. These are
so-called shorthand and child-sequence XPointers.
The former look like familiar (X)HTML named resources, as in:
xlink:href="somedoc.xml/#someid"
where "someid" (the shorthand XPointer) matches the value of some ID-type attribute in the target document. (Of course, in order to use this kind of XPointer, the source document must have some ID-type attribute declared, via DTD or schema.)
Child-sequence XPointers use the new element() XPointer scheme
to walk the processor down into the node tree without referencing
element names at all; it can simply count children. For instance,
element(1/4/3/15)
locates (hold your breath) the fifteenth child of the third child of the fourth
child of the root element. This can foster huge performance gains in
processors equipped to handle the element() scheme: an
XPath-based XPointer processor needs potentially to read in the entire
target resource in order to ensure that it's gotten every last bit of
matching content, while a child sequence-smart processor can simply
stream through the target document, taking only the designated forks in
the road and ignoring all others. (The downside, of course, is that you
can access only elements this way, and are restricted to navigating only
in a manner equivalent to XPath's child:: axis.)
10. Keep an eye out for spec changes.
|
Related Reading
XPath and XPointer |
The XPath 1.0 spec attained W3C Recommendation status in late 1999 and has been hugely successful in the three years since. But it has its shortcomings, and XPath 2.0 -- aimed at filling in the gaps -- is already on the horizon. You can find the version 2.0 Working Draft (WD) at http://www.w3.org/TR/xpath20/. The current list of known "incompatibilities" between XPath 1.0 and 2.0 appears as Appendix F, at http://www.w3.org/TR/xpath20/#id-backwards-compatibility. If you're going to be using XPath for a while, I encourage you to visit this list, in order to minimize the surprises you may have to deal with downstream.
For XPointer, the situation is a little more complicated. Until very recently, XPointer was a single WD spec (most recently attaining Candidate Recommendation status, in September of 2001). While to some observers it seemed as though it would be frozen there forever, the XML Linking Working Group in July, 2002, made a huge change: they split the one spec into four.
There's now a central "root" spec, called XPointer Framework and bumped backwards a little to WD status. This is the specification that outlines general XPointer syntax rules, levels of processor conformance, and so on.
There are also three new offshoot specs, defining the use of specific XPointer
schemes: XPointer element(), XPointer xmlns(),
and XPointer xpointer(). The first two of these are
Candidate Recommendations; the third (like the Framework) is back to WD
status. You can find these new specs at, respectively,
www.w3.org/TR/xptr-element/, www.w3.org/TR/xptr-xmlns/,
and
www.w3.org/TR/xptr-xpointer/.
