A Tour of 4Suite
by Uche Ogbuji
|
Pages: 1, 2
XPath and XPatterns
XPath is everywhere. It's established itself as the workhorse of
XML processing. The XPath engine is one of the parts of 4Suite that
has had the most development and exercise. Much of it is
implemented in C for performance sake, and this is one of the key
differences between the XPath library in current 4Suite and that in
PyXML, which is based on an older release of 4XPath, and is almost
entirely in Python. The easiest way to use the XPath library is
through the functions in Ft.Xml.XPath. Listing 3
defines a function for extracting the title from any given XHTML 1.0
file, using XPath.
Listing 3: A function for extracting HTML titles
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml.Domlette import NonvalidatingReader
XHTML_NS = "http://www.w3.org/1999/xhtml"
#compile the XPath for retrieving XHTML titles
TITLE_EXPR = Compile("string(/h:html/h:head/h:title)")
def extract_xhtml_title(uri):
"""Extract the title from the XHTML document at the given URI"""
doc = NonvalidatingReader.parseUri(uri)
#set up the context with the XHTML document node
#and namespace mapping from the "h" prefix to the XHTML URI
context = Context(doc, processorNss={"h": XHTML_NS})
#Compute the XPath against the context
title = TITLE_EXPR.evaluate(context)
return title
The Context class is a very important one. During
XPath processing, it maintains a lot of state information, including
the context items defined in the XPath spec. The most important
item in the context is the context node, which I set to the document
node of the XHTML file. In this case, I also use the context to
hold the namespace mapping from the "h" prefix which I use to the
XHTML namespace. At the global level, I compile the XPath object,
which is similar to compiling a regular expression using
re.compile(). The result is a parsed XPath object
which has an evaluate method taking a plain node object
or a full context object. The return value is a Python equivalent
of one of the four XPath data types. Strings are returned as Python
Unicode objects, numbers as Python floats, booleans as instances of
a special boolean class, and node sets as Python lists of node
objects. The XPath expression above returns a string, which is
directly returned to the caller as the requested title.
XSLT defines XPattern, a variation on XPath which is used to declare rules for matching patterns in the XML source against which to fire XSLT templates. The XPattern implementation that 4Suite's XSLT library uses is also exposed as a library of its own. XPatterns are useful when the task is not so much to compute arbitrary information from a given node but, rather, to choose quickly from a collection of nodes the ones that meet some basic rules. This might seem a subtle difference. The following example might help illustrate it.
- XPath task: extract the
classattribute from all the child elements of the context node - XPattern task: given a list of nodes, sort them into piles of those that have a
classattribute and those that have atitlechild
The main API for XPattern processing in 4Suite is
Ft.Xml.Xslt.PatternList. Listing 4 is a code snippet
that takes a node and returns a list of patterns it matches.
Listing 4: Use XPatterns to quickly determine which patterns match which nodes
from Ft.Xml.Xslt import PatternList
from Ft.Xml.Domlette import NonvalidatingReader
#first pattern matches nodes with an href attribute
#the second matches elements with a title child
PATTERNS = ["*[@class]", "*[title]"]
#Second parameter is a dictionary of prefix to namespace mappings
plist = PatternList(PATTERNS, {})
DOC = """<spam>
<e1 class="1"/>
<e2><title>A</title></e2>
<e3 class="2"><title>B</title></e3>
</spam>
"""
doc = NonvalidatingReader.parseString(DOC, "file:example4.xml")
for node in doc.documentElement.childNodes:
#Don't forget that the white space text nodes before and after
#e1, e2 and e3 elements are also child nodes of the spam element
if node.nodeName[0] == "e":
print plist.lookup(node)
The PatternList initializer takes my list of strings, which it
conveniently converts to a list of compiled XPattern objects. Such
objects have a match method that returns a boolean
value, but I don't use these methods directly in this example. The
PatternList initializer also takes a dictionary that makes up the
namespace mapping. In this example, we use no namespaces, so the
dictionary is empty. The lookup method is applied to a
selection of the children of the spam element (all the
nodes whose name starts with "e", which happens to be all the
element nodes). The output of listing 4 follows:
[*[attribute::class]]
[*[child::title]]
[*[attribute::class], *[child::title]]
The output is a list of the representations of the pattern objects that matched each node. Notice how the axis abbreviations have been expanded in the pattern object representation.
Sometimes the built-in facilities of XPath and XPattern aren't quite enough to meet your processing needs. Luckily it's easy to extend the function of these libraries using XPath user extension functions, which are written in Python. I don't cover extension functions in this article, but the resources section has pointers to useful information if you need this facility.
Python-XML Happenings
Here is a brief on significant new happenings relevant to Python-XML development, including significant software releases.
Also in Python and XML | |
Should Python and XML Coexist? | |
David Mertz announced the 1.0.4 release of gnosis XML tools. This package provides tools for converting Python objects to XML documents and vice versa, DTD to SQL conversions, and more.
Brian Quinlan announced Pyana 0.6.0. Pyana is a Python extension module for interface to the Xalan XSLT engine.
Eric van der Vlist announced XVIF 0.2.0. XVIF includes a full RELAX NG validator for Python and adds in an XML processing framework system Eric developed as a straw man for ISO DSDL. The new release adds a data typing framework and a partial WXS data types library. It also features improved internals and API.
Henry Thompson announced a new release of XSV, a Python implementation of W3C XML Schema (WXS) which also runs the W3C's on-line WXS validator service. This is release features a major restructuring of the code.
Frank Tobin announced a lightweight Python module to help write out well-formed XML. xmlprinter is inspired by Perl's XML::Writer module.
Daniel Veillard announced the 1.0.21 release of libxslt, with improved Python bindings, among other things.
4Suite 0.12.0a3 is released, which is the version I introduce in this article. Among many other changes and improvements, it includes the latest XVIF.
Resources
|