XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

A Tour of 4Suite
by Uche Ogbuji | Pages: 1, 2

XPath and XPatterns

XPath is everywhere. It's established itself as the workhorse of XML processing. The XPath engine is one of the parts of 4Suite that has had the most development and exercise. Much of it is implemented in C for performance sake, and this is one of the key differences between the XPath library in current 4Suite and that in PyXML, which is based on an older release of 4XPath, and is almost entirely in Python. The easiest way to use the XPath library is through the functions in Ft.Xml.XPath. Listing 3 defines a function for extracting the title from any given XHTML 1.0 file, using XPath.

Listing 3: A function for extracting HTML titles

from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml.Domlette import NonvalidatingReader

XHTML_NS = "http://www.w3.org/1999/xhtml"

#compile the XPath for retrieving XHTML titles
TITLE_EXPR = Compile("string(/h:html/h:head/h:title)")

def extract_xhtml_title(uri):
    """Extract the title from the XHTML document at the given URI"""
    doc = NonvalidatingReader.parseUri(uri)
    #set up the context with the XHTML document node
    #and namespace mapping from the "h" prefix to the XHTML URI
    context = Context(doc, processorNss={"h": XHTML_NS})
    #Compute the XPath against the context
    title = TITLE_EXPR.evaluate(context)
    return title

The Context class is a very important one. During XPath processing, it maintains a lot of state information, including the context items defined in the XPath spec. The most important item in the context is the context node, which I set to the document node of the XHTML file. In this case, I also use the context to hold the namespace mapping from the "h" prefix which I use to the XHTML namespace. At the global level, I compile the XPath object, which is similar to compiling a regular expression using re.compile(). The result is a parsed XPath object which has an evaluate method taking a plain node object or a full context object. The return value is a Python equivalent of one of the four XPath data types. Strings are returned as Python Unicode objects, numbers as Python floats, booleans as instances of a special boolean class, and node sets as Python lists of node objects. The XPath expression above returns a string, which is directly returned to the caller as the requested title.

XSLT defines XPattern, a variation on XPath which is used to declare rules for matching patterns in the XML source against which to fire XSLT templates. The XPattern implementation that 4Suite's XSLT library uses is also exposed as a library of its own. XPatterns are useful when the task is not so much to compute arbitrary information from a given node but, rather, to choose quickly from a collection of nodes the ones that meet some basic rules. This might seem a subtle difference. The following example might help illustrate it.

  • XPath task: extract the class attribute from all the child elements of the context node
  • XPattern task: given a list of nodes, sort them into piles of those that have a class attribute and those that have a title child

The main API for XPattern processing in 4Suite is Ft.Xml.Xslt.PatternList. Listing 4 is a code snippet that takes a node and returns a list of patterns it matches.

Listing 4: Use XPatterns to quickly determine which patterns match which nodes

from Ft.Xml.Xslt import PatternList
from Ft.Xml.Domlette import NonvalidatingReader

#first pattern matches nodes with an href attribute
#the second matches elements with a title child
PATTERNS = ["*[@class]", "*[title]"]

#Second parameter is a dictionary of prefix to namespace mappings
plist = PatternList(PATTERNS, {})

DOC = """<spam>
  <e1 class="1"/>
  <e2><title>A</title></e2>
  <e3 class="2"><title>B</title></e3>
</spam>
"""
doc = NonvalidatingReader.parseString(DOC, "file:example4.xml")
for node in doc.documentElement.childNodes:
    #Don't forget that the white space text nodes before and after
    #e1, e2 and e3 elements are also child nodes of the spam element
    if node.nodeName[0] == "e":
        print plist.lookup(node)  

The PatternList initializer takes my list of strings, which it conveniently converts to a list of compiled XPattern objects. Such objects have a match method that returns a boolean value, but I don't use these methods directly in this example. The PatternList initializer also takes a dictionary that makes up the namespace mapping. In this example, we use no namespaces, so the dictionary is empty. The lookup method is applied to a selection of the children of the spam element (all the nodes whose name starts with "e", which happens to be all the element nodes). The output of listing 4 follows:

[*[attribute::class]]
[*[child::title]]
[*[attribute::class], *[child::title]]  

The output is a list of the representations of the pattern objects that matched each node. Notice how the axis abbreviations have been expanded in the pattern object representation.

Sometimes the built-in facilities of XPath and XPattern aren't quite enough to meet your processing needs. Luckily it's easy to extend the function of these libraries using XPath user extension functions, which are written in Python. I don't cover extension functions in this article, but the resources section has pointers to useful information if you need this facility.

Python-XML Happenings

Here is a brief on significant new happenings relevant to Python-XML development, including significant software releases.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

David Mertz announced the 1.0.4 release of gnosis XML tools. This package provides tools for converting Python objects to XML documents and vice versa, DTD to SQL conversions, and more.

Brian Quinlan announced Pyana 0.6.0. Pyana is a Python extension module for interface to the Xalan XSLT engine.

Eric van der Vlist announced XVIF 0.2.0. XVIF includes a full RELAX NG validator for Python and adds in an XML processing framework system Eric developed as a straw man for ISO DSDL. The new release adds a data typing framework and a partial WXS data types library. It also features improved internals and API.

Henry Thompson announced a new release of XSV, a Python implementation of W3C XML Schema (WXS) which also runs the W3C's on-line WXS validator service. This is release features a major restructuring of the code.

Frank Tobin announced a lightweight Python module to help write out well-formed XML. xmlprinter is inspired by Perl's XML::Writer module.

Daniel Veillard announced the 1.0.21 release of libxslt, with improved Python bindings, among other things.

4Suite 0.12.0a3 is released, which is the version I introduce in this article. Among many other changes and improvements, it includes the latest XVIF.

Resources

  • For more information, see the 4Suite home page.. I and some other 4Suite developers hang out on the #4suite IRC channel on irc.freenode.net
  • You can usually find details of various aspects of the 4Suite libraries at my Python/XML Akara and 4Suite Akara.
  • There is an official RELAX NG tutorial, and Eric van der Vlist makes available chapters of his in-progress book on RELAX NG. If you are interested in Eric's XVIF extensions to RELAX NG, which are also incorporated into 4Suite, see the XVIF home page.
  • I introduce XPath and XSLT using 4Suite as examples on this tutorial, for which free registration is required. You can also try the zvon.org XPath Tutorial.
  • XPatterns are usually not covered separately, but you can learn more about XPatterns on any number of on-line XSLT tutorials and books. The W3C XSL page has many links to such resources.