XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Generating DOM Magic
by Uche Ogbuji | Pages: 1, 2

One disadvantage of the XPath approach is that it would be much slower because of the overall overhead and machinery of the XPath implementation. In theory, the raw recursion approach should also be less efficient than the generator approach, but by a much smaller margin. The expected slowness of raw recursion is due to two factors: first, Python's function call overhead -- which is less for resuming generators than for a proper recursive call -- and, second, because building the result sublists in memory takes up extra space that is saved in the generator approach. To confirm these speed hypotheses, I wrote a small performance test script:

doc = NonvalidatingReader.parseUri(BIG_FILE_NAME)
import time
start = time.time()
for i in range(10):
    for node in get_elements_by_tag_name_ns(doc, None, NAME):
        pass
end = time.time()
print "Generator getElementsByTagNameNS: %.2f"%(end-start)

start = time.time()
for i in range(10):
    for node in get_elements_by_tag_name_ns_xpath(doc, NAME):
        pass
end = time.time()
print "XPath getElementsByTagNameNS: %.2f"%(end-start)

start = time.time()
for i in range(10):
    for node in get_elements_by_tag_name_ns_raw(doc, None, NAME):
        pass
end = time.time()
print "Raw recursion getElementsByTagNameNS: %.2f"%(end-start)  

where BIG_FILE_NAME is an XML file to process and NAME is the name of an element to find throughout the file. A null namespace is assumed. I ran this against Elliotte Rusty Harold's 1998 baseball stats (stats.xml) and two other well-known large files, Hamlet (hamlet.xml) and the Old Testament (ot.xml) from John Bosak's Revised XML Document Collections. The results I found are tabulated below (all timings in seconds):

XML sourceraw file sizeelem to findnumber found10 gen runs10 XPath runs10 raw runs
hamlet.xml288KBLINE40122.1440.602.07
stats.xml644KBGAMES12265.5939.185.52
ot.xml3412KBv231458.252275.067.83

You can see how badly the XPath version lags in this test, but I was quite surprised by how close the generator and raw recursion methods are. The test machine is a Dell Inspiron 8100 laptop, Pentium III 1.13GHz, 512MB RAM, Red Hat Linux 8.0, Python 2.2.1 (built myself), 4Suite CVS from 1 January 2003. Certainly none of the tests were big enough to swamp my memory with sublists in the raw recursion function, but I still expected a general performance hit.

More DOM tricks

There are many other utility functions you can create using the generator-iterator foundations I've laid. I often find that I don't need a list of all the subelements with a particular name; instead, I need the first one from that list in document order. In other words, I often use an idiom like get_elements_by_tag_name(node, name)[0]. This is very efficient using generators: the generator only does as much work as I ask it to. If the list of elements retrieved is 23,145 elements long, the generator doesn't have to bother looking past the one it finds. This is undoubtedly an area where the generator implementation beats raw recursion, as I was able to confirm with further experimentation. Since this is such a common idiom, though, it's nice to give it a convenient handle, and so I use the function in listing 4.

Listing 4: Convenience function to get the first element by tag name below the given node in document order
def get_first_element_by_tag_name_ns(node, ns, local):
    """
    Returns the first element in document order with the given namespace
    and local name
    node - the starting point (subtree rooted at node will be searched)
           node will be returned if it has the right ns and local name
    ns - the namespace to check
    local - the local name to check
    """
    return get_elements_by_tag_name_ns(node, ns, local).next()  

The next() is the way the first item must be retrieved, rather than [0], because generators don't allow random access and only support the method to retrieve the next yielded value in order.

A common XML processing operation is to retrieve only the character data within a given node, which is basically how conversion from node to string works in XPath. You may be familiar with this operation from common uses of xsl:value-of in XSLT. Listing 5 offers this operation on DOM nodes using the generator-iterator.

Listing 5: Function to return the XPath string value of a given DOM node
def string_value(node):
    """
    Return the XPath string value of the given node
    This basically consists of one string comprising
    all the character data in the node and its descendants
    node - the starting point
    """
    text_nodes = doc_order_iter_filter(
        node, lambda n: n.nodeType == Node.TEXT_NODE
    )
    return u''.join([ n.data for n in text_nodes ])  

This function gets a generator-iterator as usual that filters out all but text nodes. Then it uses list comprehensions on the generator-iterator to create a list of the data strings from each node. Finally, it stitches this list together into a single string using join. It could be even more clever and use another generator to extract the data strings from the text nodes rather than a list comprehension, but that doesn't really gain much and would be less clear. Don't hesitate to use generators in list comprehensions, as this combination works in many very handy idioms.

Another common need is to search descendant elements for a particular substring among their text nodes; basically, a full-text search. Listing 6 defines a function for this purpose.

Listing 6: a full-text search function
def text_search(node, sought):
    """
    Return a list of descendant elements which contain a given substring
    in their text children
    node - the starting point (subtree rooted at node will be searched)
    sought - the substring to find
    """
    return doc_order_iter_filter(
        node, lambda n: n.nodeType == Node.ELEMENT_NODE \
            and [ 1 for cn in n.childNodes \
                    if cn.nodeType == Node.TEXT_NODE \
                       and cn.data.find(sought) != -1
                ]
        ) 

In this function I use the generator-iterator-filter foundation. The filter passes a descendant node if it is an element and has a child node which is text and contains the substring. Again I use list comprehensions, but in a different way this time. It returns a list that is empty if no child nodes contain the substring (tested by using the find method on the data string). If one or more of the text nodes contains the substring, then a list of one or more 1 values is returned. Since an empty list is treated like a boolean false, and a list with any items like a boolean true, this has the effect of efficiently selecting those elements containing the wanted substring.

Sometimes I only want to apply certain processing to leaf element nodes of a document; that is, to those element nodes which have no element children. Listing 7 gets an interator over such childless elements.

Listing 7: Function to return only leaf element nodes
def get_leaf_elements(node):
    """
    Return a list of elements that have no element children
    node - the starting point (subtree rooted at node will be searched)
    """
    return doc_order_iter_filter(
        node, lambda n: n.nodeType == Node.ELEMENT_NODE \
                  and not [ 1 for cn in n.childNodes \
                              if cn.nodeType == Node.ELEMENT_NODE
                          ]
    )  

This time the filter passes all element nodes that have no element children. The list comprehension approach used for this test is very similar to that in text_search.

Conclusion

Many common DOM operations can be developed using the basic techniques presented in this article. The whole module, domtools.py, with the routines I've presented is available on my web site. In a future article I'll present generator idioms for SAX and other XML-processing tools. Please don't hesitate to comment with any generator-based tricks you've used for XML processing.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

Python has added a great deal of new goodies lately. I think the aggregate effect has been to make the language better, but some people have complained about the changes. Though I've seen defenders and detractors for almost every language change, generators seem to be one of those changes that are universally acclaimed. Guido van Rossum showed his extremely sharp judgment by selecting the most practical and useful aspect of Stackless Python and making it available in Python.

December was a slow month for Python-XML news. Members of the community pursued our variants of holiday, and those of us in the Northern hemisphere found hat in the short days it felt like, as John Donne said, "the worlds whole sap is sunke". One notable December discovery was PAX, the Pythonic API for XML, which allows developers to parse an XML file into a Pythonic representation (including use of Python 2.2 iterators). PAX also includes a transformation engine and is open source.

Resources

Before reading this article, I recommend becoming familiar with Python generators by reading "Iterators and simple generators", by David Mertz and my previous article "Streamlining DOM XML processing with Python".

Christian Tismer's Stackless Python has been extensively covered on O'Reilly Network. You can gain some background on the birthplace of Python generators in "Introduction to Stackless Python" and "Programming Stackless Python", both by Cameron Laird. In "Stackless Reincarnate" Stephen Figgins discusses some recent developments in the Stackless world. George McMillan also has an excellent Stackless Python page.

David Mertz's column on IBM developerWorks includes a series on Python generators: "Iterators and simple generators", "Generator-based state machines", and "Implementing 'weightless threads' with Python generators".

If you are really serious about understanding generators, read PEP 255 (the formal specification for generators).