Menu

Generating DOM Magic

January 8, 2003

Uche Ogbuji

Python 2.2 introduced generators, a special type of function which has more flexible flow-of-control than ordinary, procedural functions. Standard procedural functions start at the top and execute until returning to the caller, maintaining all along a local state for the subroutine (which comprises local variables and passed-in parameters). The function can have multiple return points, but for each invocation, it runs until a single return, and its local state then becomes unavailable. Generators, by contrast, can send control back to the caller and yet remain in a sort of suspended animation, maintaining local state. Such a temporary return is called "yielding". The caller can then revive the generator at any time, at which point operation continues until the next yield or until a final return.

Generators, as they appeared in Python 2.2, were born in Christian Tismer's Stackless Python as a feature made practical by continuations, the core construct of Stackless. Generators are especially suited to efficient operation on tree structures, which are, of course, the natural structure for XML processing. In this article I present examples and techniques for using generators for XML processing. If you're not already familiar with Python generators, you should read " Iterators and simple generators", by David Mertz. Since this article builds on a previous one of mine, " Streamlining DOM XML processing with Python", I recommend that all readers review it before continuing. At the end of this article there are some resources which will help you acquire other relevant background knowledge.

Implementing a DOM classic

Listing 1 presents two functions for iterating over all the nodes in a DOM tree in document order, both of which were introduced in "Streamlining DOM XML processing with Python".

Listing 1: generators for filtered or unfiltered iteration over a DOM tree
#Required in Python 2.2, and must be the first import

from __future__ import generators

from xml.dom import Node



def doc_order_iter(node):

    """

    Iterates over each node in document order,

    yielding each in turn, starting with the given node

    node - the starting point

           (subtree rooted at node will be iterated over document order)

    """

    #Document order returns the current node,

    #then each of its children in turn

    yield node

    for child in node.childNodes:

        #Create a generator for each child,

        #Over which to iterate

        for cn in doc_order_iter(child):

            yield cn

    return





def doc_order_iter_filter(node, filter_func):

    """

    Iterates over each node in document order,

    applying the filter function to each in turn,

    starting with the given node, and yielding each node in

    cases where the filter function computes true

    node - the starting point

           (subtree rooted at node will be iterated over document order)

    filter_func - a callable object taking a node and returning

                  true or false

    """

    if filter_func(node):

        yield node

    for child in node.childNodes:

        for cn in doc_order_iter_filter(child, filter_func):

            yield cn

    return  

With these generators in hand we can build several other useful functions simply and easily -- functions which will perform well, too. Users of the 4Suite Domlettes sometimes complain that they do not include the handy DOM utility methods getElementsByTagName and getElementsByTagNameNS. In fact, I warned about this omission in my last article A Python & XML Companion. There are so many trivial ways to implement these as functions, but I think the best is using the generator-iterators, as in listing 2. The generator functions presented here will work with any Python DOM implementation which conforms to the published API of xml.dom.Node in standard library reference.

Listing 2: selecting descendant elements by tag name
def get_elements_by_tag_name(node, name):

    """

    Returns an iterator (in document order) over all the element nodes

    that are descendants of the given one, and have the given tag name.

y           node will be returned if it has the right name

    name - the node name to check

    """

    return doc_order_iter_filter(node, lambda n: n.nodeType == \ 

       Node.ELEMENT_NODE and n.nodeName == name)





def get_elements_by_tag_name_ns(node, ns, local):

    """

    Returns an iterator (in document order) over all the element nodes

    that are descendants of the given one, and have the given namespace

    and local name.

    node - the starting point (subtree rooted at node will be searched)

           node will be returned if it has the right ns and local name

    ns - the namespace to check

    local - the local name to check

    """

    return doc_order_iter_filter(node, lambda n: n.nodeType == \ 

       Node.ELEMENT_NODE and n.namespaceURI == ns and n.localName == local)  

get_elements_by_tag_name uses the filtered iterator, filtering by the criteria that the node is an element and that its node name matches the one given. get_elements_by_tag_name_ns is a namespace-aware variation, which takes the namespace and local name for the comparison. The following snippet illustrates their operation by printing all element nodes named "line" (with no namespace) in a document.

DOC = """<?xml version='1.0' encoding='iso-8859-1'?>

<verse>

  <attribution>Louis MacNeice</attribution>

  <line>The Laird o' Phelps spent Hogmanay declaring he was sober,</line>

  <line>Counted his feet to prove the fact and found he had one foot over.</line>

  <line>Mrs. Carmichael had her fifth, looked at the job with repulsion,</line>

  <line>Said to the midwife "Take it away; I'm through with overproduction."</line>

</verse>

"""



from Ft.Xml.Domlette import NonvalidatingReader

doc = NonvalidatingReader.parseString(DOC, "http://dummy.ns")



print "Get elements by tag name:"

for node in get_elements_by_tag_name(doc, 'line'):

    print node

print "Get elements by tag name ns:"

for node in get_elements_by_tag_name_ns(doc, None, 'line'):

    print node  

I use None to represent "no namespace" when I invoke get_elements_by_tag_name_ns. Don't forget this important Python-XML convention. When I posted this generator-based method of writing a getElementsByTagName implementation, my colleague Mike Olson followed up with an XPath approach, presented in listing 3.

Listing 3: XPath-based function to get all elements by QName
from Ft.Xml.XPath import Evaluate, Context

from Ft.Xml.Domlette import GetAllNs



def get_elements_by_tag_name_ns_xpath(node, name):

    #I need the context node and a set of namespace mappings

    #In case they use a QName for name e.g. ns:local

    #Use the dictionary of all namespace mappings on the given node

    context = Context.Context(node, GetAllNs(node))

    #Now do an XPath evaluation of an expression that searches

    #The descendant-or-self axis for all nodes whose name

    #matches the given

    return Evaluate(".//" + name, context=context)  

This approach has some advantages. For one thing, you can pass a qualified name (QName) to this function if you want to check within a given namespace, rather than separately passing in the namespace URI and local name. Therefore, if you are looking for all elements with namespace URI http://spam.com and local name eggs, and the prefix sp is bound to that namespace URI in the source document, then you can invoke get_elements_by_tag_name_ns_xpath(node, "sp:eggs") rather than the more verbose get_elements_by_tag_name_ns(node, "http://spam.com", "eggs"). Another advantage is that the function is a two-liner. One line is needed to capture the namespace mappings from the given node, and it so happens that, since 4Suite has been updated in CVS, this line will no longer be necessary and the following variation will work even given a QName.

def get_elements_by_tag_name_ns_xpath(node, name)

    #The context= is no longer needed because we're passing in the node

    #from which the context will be assumed

    return Evaluate(".//" + name, node)  

With 4Suite 0.12.0a3 and earlier versions, this function will fail with an "unknown prefix" exception if you pass in a QName. Anyway, the shorter version means that you needn't even make it a function. You can just write Evaluate(node, ".//sp:eggs") in-line wherever you need the function. A final advantage is that the XPath approach works with Python 2.1, which does not support generators. You can also write the function much the same way as with generators but using raw recursion. Listing 5 is an example of this approach.

Listing 5: A version of the get-elements-by-tag-name function using raw recursion.
def get_elements_by_tag_name_ns_raw(node, ns, local):

    if node.nodeType == Node.ELEMENT_NODE and \

       node.namespaceURI == ns and node.localName == local:

        result = [node]

    else:

        result = []

    for child in node.childNodes:

        result.extend(get_elements_by_tag_name_ns_raw(child, ns, local))

    return result  

One disadvantage of the XPath approach is that it would be much slower because of the overall overhead and machinery of the XPath implementation. In theory, the raw recursion approach should also be less efficient than the generator approach, but by a much smaller margin. The expected slowness of raw recursion is due to two factors: first, Python's function call overhead -- which is less for resuming generators than for a proper recursive call -- and, second, because building the result sublists in memory takes up extra space that is saved in the generator approach. To confirm these speed hypotheses, I wrote a small performance test script:

doc = NonvalidatingReader.parseUri(BIG_FILE_NAME)

import time

start = time.time()

for i in range(10):

    for node in get_elements_by_tag_name_ns(doc, None, NAME):

        pass

end = time.time()

print "Generator getElementsByTagNameNS: %.2f"%(end-start)



start = time.time()

for i in range(10):

    for node in get_elements_by_tag_name_ns_xpath(doc, NAME):

        pass

end = time.time()

print "XPath getElementsByTagNameNS: %.2f"%(end-start)



start = time.time()

for i in range(10):

    for node in get_elements_by_tag_name_ns_raw(doc, None, NAME):

        pass

end = time.time()

print "Raw recursion getElementsByTagNameNS: %.2f"%(end-start)  

where BIG_FILE_NAME is an XML file to process and NAME is the name of an element to find throughout the file. A null namespace is assumed. I ran this against Elliotte Rusty Harold's 1998 baseball stats (stats.xml) and two other well-known large files, Hamlet (hamlet.xml) and the Old Testament (ot.xml) from John Bosak's Revised XML Document Collections. The results I found are tabulated below (all timings in seconds):

XML source raw file size elem to find number found 10 gen runs 10 XPath runs 10 raw runs
hamlet.xml 288KB LINE 4012 2.14 40.60 2.07
stats.xml 644KB GAMES 1226 5.59 39.18 5.52
ot.xml 3412KB v 23145 8.25 2275.06 7.83

You can see how badly the XPath version lags in this test, but I was quite surprised by how close the generator and raw recursion methods are. The test machine is a Dell Inspiron 8100 laptop, Pentium III 1.13GHz, 512MB RAM, Red Hat Linux 8.0, Python 2.2.1 (built myself), 4Suite CVS from 1 January 2003. Certainly none of the tests were big enough to swamp my memory with sublists in the raw recursion function, but I still expected a general performance hit.

More DOM tricks

There are many other utility functions you can create using the generator-iterator foundations I've laid. I often find that I don't need a list of all the subelements with a particular name; instead, I need the first one from that list in document order. In other words, I often use an idiom like get_elements_by_tag_name(node, name)[0]. This is very efficient using generators: the generator only does as much work as I ask it to. If the list of elements retrieved is 23,145 elements long, the generator doesn't have to bother looking past the one it finds. This is undoubtedly an area where the generator implementation beats raw recursion, as I was able to confirm with further experimentation. Since this is such a common idiom, though, it's nice to give it a convenient handle, and so I use the function in listing 4.

Listing 4: Convenience function to get the first element by tag name below the given node in document order
def get_first_element_by_tag_name_ns(node, ns, local):

    """

    Returns the first element in document order with the given namespace

    and local name

    node - the starting point (subtree rooted at node will be searched)

           node will be returned if it has the right ns and local name

    ns - the namespace to check

    local - the local name to check

    """

    return get_elements_by_tag_name_ns(node, ns, local).next()  

The next() is the way the first item must be retrieved, rather than [0], because generators don't allow random access and only support the method to retrieve the next yielded value in order.

A common XML processing operation is to retrieve only the character data within a given node, which is basically how conversion from node to string works in XPath. You may be familiar with this operation from common uses of xsl:value-of in XSLT. Listing 5 offers this operation on DOM nodes using the generator-iterator.

Listing 5: Function to return the XPath string value of a given DOM node
def string_value(node):

    """

    Return the XPath string value of the given node

    This basically consists of one string comprising

    all the character data in the node and its descendants

    node - the starting point

    """

    text_nodes = doc_order_iter_filter(

        node, lambda n: n.nodeType == Node.TEXT_NODE

    )

    return u''.join([ n.data for n in text_nodes ])  

This function gets a generator-iterator as usual that filters out all but text nodes. Then it uses list comprehensions on the generator-iterator to create a list of the data strings from each node. Finally, it stitches this list together into a single string using join. It could be even more clever and use another generator to extract the data strings from the text nodes rather than a list comprehension, but that doesn't really gain much and would be less clear. Don't hesitate to use generators in list comprehensions, as this combination works in many very handy idioms.

Another common need is to search descendant elements for a particular substring among their text nodes; basically, a full-text search. Listing 6 defines a function for this purpose.

Listing 6: a full-text search function
def text_search(node, sought):

    """

    Return a list of descendant elements which contain a given substring

    in their text children

    node - the starting point (subtree rooted at node will be searched)

    sought - the substring to find

    """

    return doc_order_iter_filter(

        node, lambda n: n.nodeType == Node.ELEMENT_NODE \

            and [ 1 for cn in n.childNodes \

                    if cn.nodeType == Node.TEXT_NODE \

                       and cn.data.find(sought) != -1

                ]

        ) 

In this function I use the generator-iterator-filter foundation. The filter passes a descendant node if it is an element and has a child node which is text and contains the substring. Again I use list comprehensions, but in a different way this time. It returns a list that is empty if no child nodes contain the substring (tested by using the find method on the data string). If one or more of the text nodes contains the substring, then a list of one or more 1 values is returned. Since an empty list is treated like a boolean false, and a list with any items like a boolean true, this has the effect of efficiently selecting those elements containing the wanted substring.

Sometimes I only want to apply certain processing to leaf element nodes of a document; that is, to those element nodes which have no element children. Listing 7 gets an interator over such childless elements.

Listing 7: Function to return only leaf element nodes
def get_leaf_elements(node):

    """

    Return a list of elements that have no element children

    node - the starting point (subtree rooted at node will be searched)

    """

    return doc_order_iter_filter(

        node, lambda n: n.nodeType == Node.ELEMENT_NODE \

                  and not [ 1 for cn in n.childNodes \

                              if cn.nodeType == Node.ELEMENT_NODE

                          ]

    )  

This time the filter passes all element nodes that have no element children. The list comprehension approach used for this test is very similar to that in text_search.

Conclusion

Many common DOM operations can be developed using the basic techniques presented in this article. The whole module, domtools.py, with the routines I've presented is available on my web site. In a future article I'll present generator idioms for SAX and other XML-processing tools. Please don't hesitate to comment with any generator-based tricks you've used for XML processing.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Python has added a great deal of new goodies lately. I think the aggregate effect has been to make the language better, but some people have complained about the changes. Though I've seen defenders and detractors for almost every language change, generators seem to be one of those changes that are universally acclaimed. Guido van Rossum showed his extremely sharp judgment by selecting the most practical and useful aspect of Stackless Python and making it available in Python.

December was a slow month for Python-XML news. Members of the community pursued our variants of holiday, and those of us in the Northern hemisphere found hat in the short days it felt like, as John Donne said, "the worlds whole sap is sunke". One notable December discovery was PAX, the Pythonic API for XML, which allows developers to parse an XML file into a Pythonic representation (including use of Python 2.2 iterators). PAX also includes a transformation engine and is open source.

Resources

Before reading this article, I recommend becoming familiar with Python generators by reading "Iterators and simple generators", by David Mertz and my previous article "Streamlining DOM XML processing with Python".

Christian Tismer's Stackless Python has been extensively covered on O'Reilly Network. You can gain some background on the birthplace of Python generators in "Introduction to Stackless Python" and "Programming Stackless Python", both by Cameron Laird. In "Stackless Reincarnate" Stephen Figgins discusses some recent developments in the Stackless world. George McMillan also has an excellent Stackless Python page.

David Mertz's column on IBM developerWorks includes a series on Python generators: "Iterators and simple generators", "Generator-based state machines", and "Implementing 'weightless threads' with Python generators".

If you are really serious about understanding generators, read PEP 255 (the formal specification for generators).