XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Generating DOM Magic

Generating DOM Magic

January 08, 2003

Python 2.2 introduced generators, a special type of function which has more flexible flow-of-control than ordinary, procedural functions. Standard procedural functions start at the top and execute until returning to the caller, maintaining all along a local state for the subroutine (which comprises local variables and passed-in parameters). The function can have multiple return points, but for each invocation, it runs until a single return, and its local state then becomes unavailable. Generators, by contrast, can send control back to the caller and yet remain in a sort of suspended animation, maintaining local state. Such a temporary return is called "yielding". The caller can then revive the generator at any time, at which point operation continues until the next yield or until a final return.

Generators, as they appeared in Python 2.2, were born in Christian Tismer's Stackless Python as a feature made practical by continuations, the core construct of Stackless. Generators are especially suited to efficient operation on tree structures, which are, of course, the natural structure for XML processing. In this article I present examples and techniques for using generators for XML processing. If you're not already familiar with Python generators, you should read " Iterators and simple generators", by David Mertz. Since this article builds on a previous one of mine, " Streamlining DOM XML processing with Python", I recommend that all readers review it before continuing. At the end of this article there are some resources which will help you acquire other relevant background knowledge.

Implementing a DOM classic

Listing 1 presents two functions for iterating over all the nodes in a DOM tree in document order, both of which were introduced in "Streamlining DOM XML processing with Python".

Listing 1: generators for filtered or unfiltered iteration over a DOM tree
#Required in Python 2.2, and must be the first import
from __future__ import generators
from xml.dom import Node

def doc_order_iter(node):
    """
    Iterates over each node in document order,
    yielding each in turn, starting with the given node
    node - the starting point
           (subtree rooted at node will be iterated over document order)
    """
    #Document order returns the current node,
    #then each of its children in turn
    yield node
    for child in node.childNodes:
        #Create a generator for each child,
        #Over which to iterate
        for cn in doc_order_iter(child):
            yield cn
    return


def doc_order_iter_filter(node, filter_func):
    """
    Iterates over each node in document order,
    applying the filter function to each in turn,
    starting with the given node, and yielding each node in
    cases where the filter function computes true
    node - the starting point
           (subtree rooted at node will be iterated over document order)
    filter_func - a callable object taking a node and returning
                  true or false
    """
    if filter_func(node):
        yield node
    for child in node.childNodes:
        for cn in doc_order_iter_filter(child, filter_func):
            yield cn
    return  

With these generators in hand we can build several other useful functions simply and easily -- functions which will perform well, too. Users of the 4Suite Domlettes sometimes complain that they do not include the handy DOM utility methods getElementsByTagName and getElementsByTagNameNS. In fact, I warned about this omission in my last article A Python & XML Companion. There are so many trivial ways to implement these as functions, but I think the best is using the generator-iterators, as in listing 2. The generator functions presented here will work with any Python DOM implementation which conforms to the published API of xml.dom.Node in standard library reference.

Listing 2: selecting descendant elements by tag name
def get_elements_by_tag_name(node, name):
    """
    Returns an iterator (in document order) over all the element nodes
    that are descendants of the given one, and have the given tag name.
y           node will be returned if it has the right name
    name - the node name to check
    """
    return doc_order_iter_filter(node, lambda n: n.nodeType == \ 
       Node.ELEMENT_NODE and n.nodeName == name)


def get_elements_by_tag_name_ns(node, ns, local):
    """
    Returns an iterator (in document order) over all the element nodes
    that are descendants of the given one, and have the given namespace
    and local name.
    node - the starting point (subtree rooted at node will be searched)
           node will be returned if it has the right ns and local name
    ns - the namespace to check
    local - the local name to check
    """
    return doc_order_iter_filter(node, lambda n: n.nodeType == \ 
       Node.ELEMENT_NODE and n.namespaceURI == ns and n.localName == local)  

get_elements_by_tag_name uses the filtered iterator, filtering by the criteria that the node is an element and that its node name matches the one given. get_elements_by_tag_name_ns is a namespace-aware variation, which takes the namespace and local name for the comparison. The following snippet illustrates their operation by printing all element nodes named "line" (with no namespace) in a document.

DOC = """<?xml version='1.0' encoding='iso-8859-1'?>
<verse>
  <attribution>Louis MacNeice</attribution>
  <line>The Laird o' Phelps spent Hogmanay declaring he was sober,</line>
  <line>Counted his feet to prove the fact and found he had one foot over.</line>
  <line>Mrs. Carmichael had her fifth, looked at the job with repulsion,</line>
  <line>Said to the midwife "Take it away; I'm through with overproduction."</line>
</verse>
"""

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString(DOC, "http://dummy.ns")

print "Get elements by tag name:"
for node in get_elements_by_tag_name(doc, 'line'):
    print node
print "Get elements by tag name ns:"
for node in get_elements_by_tag_name_ns(doc, None, 'line'):
    print node  

I use None to represent "no namespace" when I invoke get_elements_by_tag_name_ns. Don't forget this important Python-XML convention. When I posted this generator-based method of writing a getElementsByTagName implementation, my colleague Mike Olson followed up with an XPath approach, presented in listing 3.

Listing 3: XPath-based function to get all elements by QName
from Ft.Xml.XPath import Evaluate, Context
from Ft.Xml.Domlette import GetAllNs

def get_elements_by_tag_name_ns_xpath(node, name):
    #I need the context node and a set of namespace mappings
    #In case they use a QName for name e.g. ns:local
    #Use the dictionary of all namespace mappings on the given node
    context = Context.Context(node, GetAllNs(node))
    #Now do an XPath evaluation of an expression that searches
    #The descendant-or-self axis for all nodes whose name
    #matches the given
    return Evaluate(".//" + name, context=context)  

This approach has some advantages. For one thing, you can pass a qualified name (QName) to this function if you want to check within a given namespace, rather than separately passing in the namespace URI and local name. Therefore, if you are looking for all elements with namespace URI http://spam.com and local name eggs, and the prefix sp is bound to that namespace URI in the source document, then you can invoke get_elements_by_tag_name_ns_xpath(node, "sp:eggs") rather than the more verbose get_elements_by_tag_name_ns(node, "http://spam.com", "eggs"). Another advantage is that the function is a two-liner. One line is needed to capture the namespace mappings from the given node, and it so happens that, since 4Suite has been updated in CVS, this line will no longer be necessary and the following variation will work even given a QName.

def get_elements_by_tag_name_ns_xpath(node, name)
    #The context= is no longer needed because we're passing in the node
    #from which the context will be assumed
    return Evaluate(".//" + name, node)  

With 4Suite 0.12.0a3 and earlier versions, this function will fail with an "unknown prefix" exception if you pass in a QName. Anyway, the shorter version means that you needn't even make it a function. You can just write Evaluate(node, ".//sp:eggs") in-line wherever you need the function. A final advantage is that the XPath approach works with Python 2.1, which does not support generators. You can also write the function much the same way as with generators but using raw recursion. Listing 5 is an example of this approach.

Listing 5: A version of the get-elements-by-tag-name function using raw recursion.
def get_elements_by_tag_name_ns_raw(node, ns, local):
    if node.nodeType == Node.ELEMENT_NODE and \
       node.namespaceURI == ns and node.localName == local:
        result = [node]
    else:
        result = []
    for child in node.childNodes:
        result.extend(get_elements_by_tag_name_ns_raw(child, ns, local))
    return result  

Pages: 1, 2

Next Pagearrow







close