
Location, Location, Location
It is often useful to keep track of the location of some data in an XML file being processed. If you are parsing a file in order to perform sophisticated search and analysis tasks, you may want to know in which element or other such node a specific pattern was found (or even at what file location). XPath is the standard way to convey the location of an XML node. In the case of DOM, you might like to be able to compute an XPath expression selecting a specific node. In the case of SAX, you might want to have an XPath location for a current event, or you may want to get information on a current file location from the parser. In this article, I cover techniques for figuring out such location information. Along the way, I shall be providing some examples of marginally documented corners of Python's SAX libraries.
Locating DOM Nodes
Alexandre Conrad asked the question on XML-SIG: "In XPath, is there a way I can get the absolute path for a node?" You can compute a unique absolute path for any node by using positional predicates in the portion of the path that identifies a specific element and then by using name tests to find attributes or similar tricks for other node types. I'll illustrate using the the sample file in Listing 1.
Listing 1: Sample XML File (labels.xml) Containing Address Labels
<?xml version="1.0" encoding="iso-8859-1"?>
<labels>
<label added="2003-06-20">
<quote>
<emph>Midwinter Spring</emph> is its own season…
</quote>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
<label added="2003-06-10">
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
The following are examples of unique XPath absolute paths for some nodes in Listing 1.
/: the (abstract) root of the document/labels[1]: the root element (Actually just/labelsis sufficient, but the algorithm described in this article uses positional predicates for all element node tests, for consistency)/labels[1]/label[1]: the label element corresponding to Eliot/labels[1]/label[2]: the label element corresponding to Pound/labels[1]/label[1]/@added: attribute with value "2003-06-20"/labels[1]/label[1]/quote[1]/emph[1]/text()[1]: text node with value "Midwinter Spring"
Florian Bösch
posted code on XML-SIG for returning such absolute paths from a given node. Starting from his code, I developed a more complete function,
abs_path, presented in Listing 2 (listing2.py).
Listing 2 (listing2.py): abs_path Function for Returning a Unique Absolute XPath for a Given DOM Node Containing Address Labels
# -*- coding: iso-8859-1 -*-
from xml.dom import Node
#Mapping from node type to XPath node test function name
OTHER_NODES = {
Node.TEXT_NODE: 'text',
Node.COMMENT_NODE: 'comment',
Node.PROCESSING_INSTRUCTION_NODE: 'processing-instruction'
}
def abs_path(node):
#based on code developed by Florian Bösch on XML-SIG
#http://mail.python.org/pipermail/xml-sig/2004-August/010423.html
#Significantly enhanced to use Unicode properly, support more
#node types, use safer node type tests, etc.
"""
Return an XPath expression that provides a unique path to
the given node (supports elements, attributes, root nodes,
text nodes, comments and PIs) within a document
"""
if node.nodeType == Node.ELEMENT_NODE:
count = 1
#Count previous siblings with same node name
previous = node.previousSibling
while previous:
if previous.localName == node.localName: count += 1
previous = previous.previousSibling
step = u'%s[%i]' % (node.nodeName, count)
ancestor = node.parentNode
elif node.nodeType == Node.ATTRIBUTE_NODE:
step = u'@%s' % (node.nodeName)
ancestor = node.ownerElement
elif node.nodeType in OTHER_NODES:
#Text nodes, comments and PIs
count = 1
#Count previous siblings of the same node type
previous = node.previousSibling
while previous:
if previous.nodeType == node.nodeType: count += 1
previous = previous.previousSibling
test_func = OTHER_NODES[node.nodeType]
step = u'%s()[%i]' % (test_func, count)
ancestor = node.parentNode
elif not node.parentNode:
#Root node
step = u''
ancestor = node
else:
raise TypeError('Unsupported node type for abs_path')
if ancestor.parentNode:
return abs_path(ancestor) + u'/' + step
else:
return u'/' + step
Have a look at the first line, a comment with the new standard
instruction to specify that the Python source file is not a plain
ASCII file. (I mention Florian Bösch's name in acknowledgements in a
comment.) If you're not familiar with this instruction, introduced
for Python 2.3, see PEP 263. In
general, I coded the function for any Python 2.2 or more recent, and
I tested it with Python 2.3. I think the only construct that won't
work in 2.0 and 2.1 is node.nodeType in OTHER_NODES
(rather than node.nodeType in OTHER_NODES.keys()). The
code is straightforward, using simple recursion to build each step
of the absolute path while working up towards the root of the DOM
tree. The following session illustrates the use
of abs_path.
>>> from listing2 import abs_path
>>> #If you're using minidom
>>> from xml.dom import minidom
>>> doc = minidom.parse('labels.xml')
>>> #If you're using 4Suite's Domlette
>>> from Ft.Xml.Domlette import NonvalidatingReader
>>> from Ft.Lib import Uri
>>> file_uri = Uri.OsPathToUri('labels.xml', attemptAbsolute=1)
>>> doc = NonvalidatingReader.parseUri(file_uri)
>>> The rest is DOM-agnostic
>>> print abs_path(doc)
/
>>> print abs_path(doc.documentElement)
/labels[1]
>>> for node in doc.documentElement.childNodes: print abs_path(node)
/labels[1]/text()[1]
/labels[1]/label[1]
/labels[1]/text()[2]
/labels[1]/label[2]
/labels[1]/text()[3]
>>>
I include abs_path in my domtools module.
File Locations from SAX
Another approach to getting information about where you are in a
document is tracking the location of a current event while using a
streaming interface such as SAX. SAX provides for this using
the Locator interface. As explained in the xml.sax
standard library documentation, "SAX parsers are strongly encouraged
(though not absolutely required) to supply a locator: if [a parser]
does so, it must supply the locator to the application by invoking
[the method setDocumentLocator] before invoking any of the other
methods in the DocumentHandler interface." Every SAX driver I know of
that comes with Python or on PyXML supports locators. Listing 3
serves as an example of how useful SAX locators can be. It is a
simple, XML-aware, regular expressions search that is smart enough to
only search actual element content (as opposed to, say, tags, comments,
or attribute content), and it reports the position where any match was
found.
Listing 3 (listing3.py): SAX Code for Regex Search of Element Content
from xml import sax
import sys
import re
class content_regex(sax.ContentHandler):
"""
Search only the content of text for a given regex, and
report the file position of each match
"""
def __init__(self, search_str):
self.locator = None
#Compile the given regex for quick search
self.search_pat = re.compile(search_str)
pass
#Overridden DocumentHandler methods
def setDocumentLocator(self, locator):
#If the parser supports location info it invokes this event
#before any other methods in the DocumentHandler interface
self.locator = locator
return
def startDocument(self):
if self.locator is None:
raise RuntimeError('The parser does not support locators')
return
def characters(self, text):
#Use the locator to record where we are
line = self.locator.getLineNumber()
col = self.locator.getColumnNumber()
entity = self.locator.getSystemId()
results = self.search_pat.finditer(text)
for match in results:
#Display information for each match
print 'match "' + match.group() + '" at offset', match.pos,
print 'from line', line,', column', col,
print 'in entity', entity
return
if __name__ == "__main__":
parser = sax.make_parser()
file_to_search = sys.argv[1]
search_str = sys.argv[2]
handler = content_regex(search_str)
parser.setContentHandler(handler)
parser.parse(file_to_search)
The following session illustrates this code at work:
$ python listing3.py labels.xml "CT"
match"CT" at offset 0 from line 12 , column 13 in entity labels.xml
$ python listing3.py labels.xml "[0-9]+"
match "3" at offset 0 from line 10 , column 14 in entity labels.xml
match "45" at offset 0 from line 18 , column 14 in entity labels.xml
Pages: 1, 2 |