Generating DOM Magic
by Uche Ogbuji
|
Pages: 1, 2
One disadvantage of the XPath approach is that it would be much
slower because of the overall overhead and machinery of the XPath
implementation. In theory, the raw recursion approach should also be
less efficient than the generator approach, but by a much smaller
margin. The expected slowness of raw recursion is due to two factors:
first, Python's function call overhead -- which is less for resuming
generators than for a proper recursive call -- and, second, because
building the result sublists in memory takes up extra space that is
saved in the generator approach. To confirm these speed hypotheses, I
wrote a small performance test script:
doc = NonvalidatingReader.parseUri(BIG_FILE_NAME)
import time
start = time.time()
for i in range(10):
for node in get_elements_by_tag_name_ns(doc, None, NAME):
pass
end = time.time()
print "Generator getElementsByTagNameNS: %.2f"%(end-start)
start = time.time()
for i in range(10):
for node in get_elements_by_tag_name_ns_xpath(doc, NAME):
pass
end = time.time()
print "XPath getElementsByTagNameNS: %.2f"%(end-start)
start = time.time()
for i in range(10):
for node in get_elements_by_tag_name_ns_raw(doc, None, NAME):
pass
end = time.time()
print "Raw recursion getElementsByTagNameNS: %.2f"%(end-start)
where BIG_FILE_NAME is an XML file to process and
NAME is the name of an element to find throughout the file.
A null namespace is assumed. I ran this against Elliotte
Rusty Harold's 1998 baseball stats (stats.xml) and two other
well-known large files, Hamlet (hamlet.xml) and the Old Testament
(ot.xml) from John
Bosak's Revised XML Document Collections. The results I found are
tabulated below (all timings in seconds):
| XML source | raw file size | elem to find | number found | 10 gen runs | 10 XPath runs | 10 raw runs |
| hamlet.xml | 288KB | LINE | 4012 | 2.14 | 40.60 | 2.07 |
| stats.xml | 644KB | GAMES | 1226 | 5.59 | 39.18 | 5.52 |
| ot.xml | 3412KB | v | 23145 | 8.25 | 2275.06 | 7.83 |
You can see how badly the XPath version lags in this test, but I was
quite surprised by how close the generator and raw recursion methods
are. The test machine is a Dell Inspiron 8100 laptop, Pentium III
1.13GHz, 512MB RAM, Red Hat Linux 8.0, Python
2.2.1 (built myself), 4Suite CVS from 1 January 2003. Certainly
none of the tests were big enough to swamp my memory with sublists in the
raw recursion function, but I still expected a general performance hit.
More DOM tricks
There are many other utility functions you can create using the
generator-iterator foundations I've laid. I often find that I don't need
a list of all the subelements with a particular name; instead, I need the
first one from that list in document order. In other words, I often use
an idiom like get_elements_by_tag_name(node, name)[0]. This
is very efficient using generators: the generator only does as much work
as I ask it to. If the list of elements retrieved is 23,145 elements
long, the generator doesn't have to bother looking past the one it finds.
This is undoubtedly an area where the generator implementation beats raw
recursion, as I was able to confirm with further experimentation. Since
this is such a common idiom, though, it's nice to give it a convenient
handle, and so I use the function in listing 4.
Listing 4: Convenience function to get the first element by tag name below the given node in document order
def get_first_element_by_tag_name_ns(node, ns, local):
"""
Returns the first element in document order with the given namespace
and local name
node - the starting point (subtree rooted at node will be searched)
node will be returned if it has the right ns and local name
ns - the namespace to check
local - the local name to check
"""
return get_elements_by_tag_name_ns(node, ns, local).next()
The next() is the way the first item must be retrieved,
rather than [0], because generators don't allow random
access and only support the method to retrieve the next yielded value in
order.
A common XML processing operation is to retrieve only the
character data within a given node, which is basically how
conversion from node to string works in XPath. You may be familiar
with this operation from common uses of xsl:value-of in
XSLT. Listing 5 offers this operation on DOM nodes using the
generator-iterator.
Listing 5: Function to return the XPath string value of a given DOM node
def string_value(node):
"""
Return the XPath string value of the given node
This basically consists of one string comprising
all the character data in the node and its descendants
node - the starting point
"""
text_nodes = doc_order_iter_filter(
node, lambda n: n.nodeType == Node.TEXT_NODE
)
return u''.join([ n.data for n in text_nodes ])
This function gets a generator-iterator as usual that filters out all
but text nodes. Then it uses list comprehensions on the
generator-iterator to create a list of the data strings from each node.
Finally, it stitches this list together into a single string using
join. It could be even more clever and use another
generator to extract the data strings from the text nodes rather than a
list comprehension, but that doesn't really gain much and would be less
clear. Don't hesitate to use generators in list comprehensions, as this
combination works in many very handy idioms.
Another common need is to search descendant elements for a particular
substring among their text nodes; basically, a full-text search.
Listing 6 defines a function for this purpose.
Listing 6: a full-text search function
def text_search(node, sought):
"""
Return a list of descendant elements which contain a given substring
in their text children
node - the starting point (subtree rooted at node will be searched)
sought - the substring to find
"""
return doc_order_iter_filter(
node, lambda n: n.nodeType == Node.ELEMENT_NODE \
and [ 1 for cn in n.childNodes \
if cn.nodeType == Node.TEXT_NODE \
and cn.data.find(sought) != -1
]
)
In this function I use the generator-iterator-filter foundation.
The filter passes a descendant node if it is an element and has a
child node which is text and contains the substring. Again I use
list comprehensions, but in a different way this time. It returns a
list that is empty if no child nodes contain the substring (tested
by using the find method on the data string). If one
or more of the text nodes contains the substring, then a list of one
or more 1 values is returned. Since an empty list is
treated like a boolean false, and a list with any items like a
boolean true, this has the effect of efficiently selecting those
elements containing the wanted substring.
Sometimes I only want to apply certain processing to leaf element
nodes of a document; that is, to those element nodes which have no
element children. Listing 7 gets an interator over such childless
elements.
Listing 7: Function to return only leaf element nodes
def get_leaf_elements(node):
"""
Return a list of elements that have no element children
node - the starting point (subtree rooted at node will be searched)
"""
return doc_order_iter_filter(
node, lambda n: n.nodeType == Node.ELEMENT_NODE \
and not [ 1 for cn in n.childNodes \
if cn.nodeType == Node.ELEMENT_NODE
]
)
This time the filter passes all element nodes that have no
element children. The list comprehension approach used for this test
is very similar to that in text_search.
Conclusion
Many common DOM operations can be developed using the basic techniques
presented in this article. The whole module, domtools.py,
with the routines I've presented is available
on my web
site. In a future article I'll present generator idioms for SAX and
other XML-processing tools. Please don't hesitate to comment with any
generator-based tricks you've used for XML processing.
Python has added a great deal of new goodies lately. I think the
aggregate effect has been to make the language better, but some people
have complained about the changes. Though I've seen defenders and
detractors for almost every language change, generators seem to be one of
those changes that are universally acclaimed. Guido van Rossum showed his
extremely sharp judgment by selecting the most practical and useful aspect
of Stackless Python and making it available in Python.
December was a slow month for Python-XML news. Members of the
community pursued our variants of holiday, and those of us in the
Northern hemisphere found hat in the short days it felt like, as John
Donne said, "the worlds whole sap is sunke". One notable December
discovery was PAX,
the Pythonic API for XML, which allows developers to parse an XML file
into a Pythonic representation (including use of Python 2.2
iterators). PAX also includes a transformation engine and is open
source.