Menu

Location, Location, Location

November 24, 2004

Uche Ogbuji

It is often useful to keep track of the location of some data in an XML file being processed. If you are parsing a file in order to perform sophisticated search and analysis tasks, you may want to know in which element or other such node a specific pattern was found (or even at what file location). XPath is the standard way to convey the location of an XML node. In the case of DOM, you might like to be able to compute an XPath expression selecting a specific node. In the case of SAX, you might want to have an XPath location for a current event, or you may want to get information on a current file location from the parser. In this article, I cover techniques for figuring out such location information. Along the way, I shall be providing some examples of marginally documented corners of Python's SAX libraries.

Locating DOM Nodes

Alexandre Conrad asked the question on XML-SIG: "In XPath, is there a way I can get the absolute path for a node?" You can compute a unique absolute path for any node by using positional predicates in the portion of the path that identifies a specific element and then by using name tests to find attributes or similar tricks for other node types. I'll illustrate using the the sample file in Listing 1.

Listing 1: Sample XML File (labels.xml) Containing Address Labels

<?xml version="1.0" encoding="iso-8859-1"?>

<labels>

  <label added="2003-06-20">

    <quote>

      <emph>Midwinter Spring</emph> is its own season&#8230;

    </quote>

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

  </label>

  <label added="2003-06-10">

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

  </label>

</labels>

  

The following are examples of unique XPath absolute paths for some nodes in Listing 1.

  • /: the (abstract) root of the document
  • /labels[1]: the root element (Actually just /labels is sufficient, but the algorithm described in this article uses positional predicates for all element node tests, for consistency)
  • /labels[1]/label[1]: the label element corresponding to Eliot
  • /labels[1]/label[2]: the label element corresponding to Pound
  • /labels[1]/label[1]/@added: attribute with value "2003-06-20"
  • /labels[1]/label[1]/quote[1]/emph[1]/text()[1]: text node with value "Midwinter Spring"

Florian Bösch posted code on XML-SIG for returning such absolute paths from a given node. Starting from his code, I developed a more complete function, abs_path, presented in Listing 2 (listing2.py).

Listing 2 (listing2.py): abs_path Function for Returning a Unique Absolute XPath for a Given DOM Node Containing Address Labels

# -*- coding: iso-8859-1 -*-

from xml.dom import Node



#Mapping from node type to XPath node test function name

OTHER_NODES = {

    Node.TEXT_NODE: 'text',

    Node.COMMENT_NODE: 'comment',

    Node.PROCESSING_INSTRUCTION_NODE: 'processing-instruction'

    }



def abs_path(node):

    #based on code developed by Florian Bösch on XML-SIG

    #http://mail.python.org/pipermail/xml-sig/2004-August/010423.html

    #Significantly enhanced to use Unicode properly, support more

    #node types, use safer node type tests, etc.

    """

    Return an XPath expression that provides a unique path to

    the given node (supports elements, attributes, root nodes,

    text nodes, comments and PIs) within a document

    """

    if node.nodeType == Node.ELEMENT_NODE:

        count = 1

        #Count previous siblings with same node name

        previous = node.previousSibling

        while previous:

            if previous.localName == node.localName: count += 1

            previous = previous.previousSibling

        step = u'%s[%i]' % (node.nodeName, count)

        ancestor = node.parentNode

    elif node.nodeType == Node.ATTRIBUTE_NODE:

        step = u'@%s' % (node.nodeName)

        ancestor = node.ownerElement

    elif node.nodeType in OTHER_NODES:

        #Text nodes, comments and PIs

        count = 1

        #Count previous siblings of the same node type

        previous = node.previousSibling

        while previous:

            if previous.nodeType == node.nodeType: count += 1

            previous = previous.previousSibling

        test_func = OTHER_NODES[node.nodeType]

        step = u'%s()[%i]' % (test_func, count)

        ancestor = node.parentNode

    elif not node.parentNode:

        #Root node

        step = u''

        ancestor = node

    else:

        raise TypeError('Unsupported node type for abs_path')

    if ancestor.parentNode:

        return abs_path(ancestor) + u'/' + step

    else:

        return u'/' + step

  

Have a look at the first line, a comment with the new standard instruction to specify that the Python source file is not a plain ASCII file. (I mention Florian Bösch's name in acknowledgements in a comment.) If you're not familiar with this instruction, introduced for Python 2.3, see PEP 263. In general, I coded the function for any Python 2.2 or more recent, and I tested it with Python 2.3. I think the only construct that won't work in 2.0 and 2.1 is node.nodeType in OTHER_NODES (rather than node.nodeType in OTHER_NODES.keys()). The code is straightforward, using simple recursion to build each step of the absolute path while working up towards the root of the DOM tree. The following session illustrates the use of abs_path.


>>> from listing2 import abs_path

>>> #If you're using minidom

>>> from xml.dom import minidom

>>> doc = minidom.parse('labels.xml')

>>> #If you're using 4Suite's Domlette

>>> from Ft.Xml.Domlette import NonvalidatingReader

>>> from Ft.Lib import Uri

>>> file_uri = Uri.OsPathToUri('labels.xml', attemptAbsolute=1)

>>> doc = NonvalidatingReader.parseUri(file_uri)



>>> The rest is DOM-agnostic

>>> print abs_path(doc)

/

>>> print abs_path(doc.documentElement)

/labels[1]

>>> for node in doc.documentElement.childNodes: print abs_path(node)

/labels[1]/text()[1]

/labels[1]/label[1]

/labels[1]/text()[2]

/labels[1]/label[2]

/labels[1]/text()[3]

>>>

  

I include abs_path in my domtools module.

File Locations from SAX

Another approach to getting information about where you are in a document is tracking the location of a current event while using a streaming interface such as SAX. SAX provides for this using the Locator interface. As explained in the xml.sax standard library documentation, "SAX parsers are strongly encouraged (though not absolutely required) to supply a locator: if [a parser] does so, it must supply the locator to the application by invoking [the method setDocumentLocator] before invoking any of the other methods in the DocumentHandler interface." Every SAX driver I know of that comes with Python or on PyXML supports locators. Listing 3 serves as an example of how useful SAX locators can be. It is a simple, XML-aware, regular expressions search that is smart enough to only search actual element content (as opposed to, say, tags, comments, or attribute content), and it reports the position where any match was found.

Listing 3 (listing3.py): SAX Code for Regex Search of Element Content

from xml import sax

import sys

import re



class content_regex(sax.ContentHandler):

    """

    Search only the content of text for a given regex, and

    report the file position of each match

    """

    def __init__(self, search_str):

        self.locator = None

        #Compile the given regex for quick search

        self.search_pat = re.compile(search_str)

        pass



    #Overridden DocumentHandler methods

    def setDocumentLocator(self, locator):

        #If the parser supports location info it invokes this event

        #before any other methods in the DocumentHandler interface

        self.locator = locator

        return



    def startDocument(self):

        if self.locator is None:

            raise RuntimeError('The parser does not support locators')

        return



    def characters(self, text):

        #Use the locator to record where we are

        line = self.locator.getLineNumber()

        col = self.locator.getColumnNumber()

        entity = self.locator.getSystemId()

        results = self.search_pat.finditer(text)

        for match in results:

            #Display information for each match

            print 'match "' + match.group() + '" at offset', match.pos,

            print 'from line', line,', column', col,

            print 'in entity', entity

        return





if __name__ == "__main__":

    parser = sax.make_parser()

    file_to_search = sys.argv[1]

    search_str = sys.argv[2]



    handler = content_regex(search_str)

    parser.setContentHandler(handler)

    parser.parse(file_to_search)

  

The following session illustrates this code at work:


$ python listing3.py labels.xml "CT"

match"CT" at offset 0 from line 12 , column 13 in entity labels.xml

$ python listing3.py labels.xml "[0-9]+"

match "3" at offset 0 from line 10 , column 14 in entity labels.xml

match "45" at offset 0 from line 18 , column 14 in entity labels.xml

  

Absolute Paths from SAX

You can also compute context XPaths while in a SAX handler. Let's say you want a similar regex search as in Listing 3, but rather, you want it reporting the path to the element containing the match rather than the exact character offset. Listing 4 does the trick.

Listing 4 (listing4.py): SAX Code for Regex Search of Element Content, Reporting as XPath

from xml import sax

import sys

import re



TOP = -1



class content_regex(sax.ContentHandler):

    """

    Search only the content of text for a given regex, and

    report the parent element of each match

    """

    def __init__(self, search_str):

        #Compile the given regex for quick search

        self.search_pat = re.compile(search_str)

        #Keep track of the names at the level of the current element's

        #Children so correct positional predicates can be computed

        self.children = []

        #We have to keep a stack of the sibling names for

        #Each level so we don't lose context

        self.elem_name_stack = [[]]

        pass

 

    def startDocument(self):

        #Path components representing XPath steps to current element

        self.steps = [u'']

        return

 

    #Overridden DocumentHandler methods

    def startElement(self, name, attribs):

        #Update list for sibling element names

        self.elem_name_stack[TOP].append(name)

        #Count preceding siblings of the same name as current

        #(count starts at 1, as required by XPath, since we already

        #added the current name to the list)

        name_count = len([ 1 for sib_name in self.elem_name_stack[TOP]

                          if sib_name == name

                        ])

        #Update steps list, using the computed positional predicate

        self.steps.append(name+'['+str(name_count)+']')

        #Stack things up properly for the child elements

        self.elem_name_stack.append(self.children)

        self.children = []

        return



    def endElement(self, name):

        #Pop back up the stack to level of siblings

        self.elem_name_stack.pop()

        self.steps.pop()

        return



    def characters(self, text):

        results = self.search_pat.finditer(text)

        for match in results:

            #Display information for each match

            print 'match "' + match.group() + '" in element at path',

            #Generate path by putting together the current steps

            print u'/'.join(self.steps)

        return





if __name__ == "__main__":

    file_to_search = sys.argv[1]

    search_str = sys.argv[2]

    parser = sax.make_parser()

 

    handler = content_regex(search_str)

    parser.setContentHandler(handler)

    parser.parse(file_to_search)

  

As one expects with SAX, the state management code becomes a bit mind-numbing here, but you should be able to reuse the basics of the technique in similar SAX handlers. If you do, consider whether you need to watch out for fragmented text events. In this case, it doesn't matter if contiguous text is split into multiple events because all we are tracking is the path to the parent element (in Listing 3, it didn't matter because we only cared about the position information, which is independent of how the markup is chunked). If you need unfragmented text, just use a SAX filter to clean things up first.

The following session illustrates the operation of Listing 4:


$ python listing4.py labels.xml "ID"

match "ID" in element at path /labels[1]/label[2]/address[1]/state[1]

$ python listing4.py labels.xml "[0-9]+"

match "3" in element at path /labels[1]/label[1]/address[1]/street[1]

match "45" in element at path /labels[1]/label[2]/address[1]/street[1]

  

Ask the Columnist: Getting IDs from a SAX Parse

I often get questions from folks who read this column and want to ask about some point relating to XML processing in Python. Occasionally, I get a question that seems worth exploring in this column because it is of some broad interest. I shall start to discuss such questions here from time to time in an "ask the columnist" section. If you have a question you'd like me to tackle in an upcoming article, send me an email. Questions can range from the best techniques for some XML processing task to what packages I may have run across that can help with some particular problem. I'll pick the best questions that I can handle and address them, with any useful background or discussion.

I'll kick things off by tackling a very interesting question I came across on XML-SIG. Recently, Lloyd Kvam asked about APIs for listing all the ID type attributes in a document. Andrew Clover suggested some DOM techniques that might do the trick, but my first thought on the problem was to have a SAX handler or filter gather up all ID attributes during the parse. In order to properly identify attributes of type ID, it would require a SAX driver that could report attribute declarations. SAX 2 does provide an optional interface, DeclHandler which can report such things, and PyXML (but not plain Python) has some marginally documented support for it. To demonstrate this code, I developed a simple SAX handler that gathers a list of all ID and IDREF attribute declarations and instances. Listing 5 is a sample document with an internal DTD subset to properly declare some ID and IDREF attributes.

Listing 5: Sample XML File with Internal DTD Subset (labels-dtd.xml)

<?xml version="1.0" encoding="iso-8859-1"?>

<!DOCTYPE labels [

  <!ELEMENT labels (label*)>

  <!ELEMENT label (quote*, associate*, name, address)>

  <!ATTLIST label id ID #REQUIRED>

  <!ATTLIST label added CDATA #REQUIRED>



  <!ELEMENT quote (#PCDATA|emph)*>

  <!ELEMENT emph (#PCDATA)>

  <!ELEMENT associate EMPTY>

  <!ATTLIST associate ref IDREF #REQUIRED>



  <!ELEMENT name (#PCDATA)>

  <!ELEMENT address (street, city, state)>

  <!ELEMENT street (#PCDATA)>

  <!ELEMENT city (#PCDATA)>

  <!ELEMENT state (#PCDATA)>

]>

<labels>

  <label id="tse" added="2003-06-20">

    <quote>

      <emph>Midwinter Spring</emph> is its own season&#8230;

    </quote>

    <associate ref="ep"/>

    <name>Thomas Eliot</name>

    <address>

      <street>3 Prufrock Lane</street>

      <city>Stamford</city>

      <state>CT</state>

    </address>

  </label>

  <label id="ep" added="2003-06-10">

    <associate ref="tse"/>

    <name>Ezra Pound</name>

    <address>

      <street>45 Usura Place</street>

      <city>Hailey</city>

      <state>ID</state>

    </address>

  </label>

  <label id="lh" added="2004-11-01">

    <name>Langston Hughes</name>

    <address>

      <street>10 Bridge Tunnel</street>

      <city>Harlem</city>

      <state>NY</state>

    </address>

  </label>

</labels>

  

Listing 6 is the code for tracking these attributes.

Listing 6 (listing6.py): SAX Code to Gather up All ID and IDREF Attributes

from xml.sax import sax2exts

from xml import sax

from xml.sax import saxlib

import sys



ID_TYPE = u'ID'

IDREF_TYPE = u'IDREF'



class id_scanner(sax.ContentHandler, saxlib.DeclHandler):

    """



    """

    def __init__(self):

        #Both ID and IDREF list entries are of the form

        #(<element-name>, <attr-name>, <attr-value>)

        self.ids = []

        self.idrefs = []

        #List of declared attributes of types we care about

        self.idtypes = {}

        self.idreftypes = {}

        pass

 

    #Overridden DocumentHandler methods

    def startElement(self, name, attribs):

        for idattr in self.idtypes.get(name, []):

            if idattr in attribs.keys():

                value = attribs[idattr]

                self.ids.append((name, idattr, value))

        for idrefattr in self.idreftypes.get(name, []):

            if idrefattr in attribs.keys():

                value = attribs[idrefattr]

                self.idrefs.append((name, idrefattr, value))

        return



    def endElement(self, name):

        return



    def attributeDecl(self, elem, attr, type_, a_decl, a_def):

        if type_ == ID_TYPE:

            self.idtypes.setdefault(elem, []).append(attr)

        elif type_ == IDREF_TYPE:

            self.idreftypes.setdefault(elem, []).append(attr)

        return





if __name__ == "__main__":

    file_to_scan = sys.argv[1]

    handler = id_scanner()

    #One way to set up the parser: retrieve whatever can validate

    parser = sax2exts.XMLValParserFactory.make_parser()

    #But just to be sure we get xmlproc, be extra specific

    parser = sax.make_parser(['xml.sax.drivers2.drv_xmlproc'])

    parser.setProperty(

        "http://xml.org/sax/properties/declaration-handler",

        handler)

 

    parser.setContentHandler(handler)

    parser.parse(file_to_scan)



    print 'IDs', handler.ids

    print 'IDREFs', handler.idrefs

  

The two main new optional handler interfaces in SAX 2 are defined in the classes xml.sax.saxlib.DeclHandler and xml.sax.saxlib.LexicalHandler. They are not documented except in the source code, and xmlproc (part of PyXML, which is required for Listing 6) is the only Python I know of that supports them. You install your own handlers for these by setting the special parser property, as in the bottom of Listing 6. In this example, all I am concerned about are attribute declarations, so I subclass DeclHandler in order to specialize the attributeDecl method. See xml.sax.saxlib for more details of this and other less common handlers. The attribute declarations are called pretty much where one would expect: after ContentHandler.startDocument and DtdHandler.startDTD but before other events. In this phase, I prepare a mapping from each element to lists of ID- and IDREF-type attributes. In the more well-known ContentHandler.startElement, I can then use this mapping to identify and gather attributes of interest.

The following session illustrates the operation of Listing 6:


$ python listing6.py labels-dtd.xml

[(u'label', u'id', u'tse'), (u'label', u'id', u'ep'), (u'label', u'id', u'lh')]

[(u'associate', u'ref', u'ep'), (u'associate', u'ref', u'tse')]

  

News from the Community

How many "XMLObjects" does it take to screw in a lightbulb? Turns out that even I, who make it my business to pay attention to such things, came up short in my count. Philippe Normand, author of one of the "XMLObjects", lamented the name clash after the latest entrant emerged. Srijit Kumar Bhadra, an innocent bystander (and author of the Python/.NET/XML code bake-off I mentioned last month) also complained. The trigger for all this was Greg Luterman's announcement of XMLObject, "a Python module that simplifies the handling of XML streams by converting the data into objects." Of course, anyone who chooses as generic a name as "XMLObject" is just asking for name clashes.

Meanwhile, there is yet another entrant in the ongoing quest to develop a more Pythonic wrapper for libxml2. Victor Ng announced vlibxml2 release 0.1.177, a memory-managed XML library based on libxml2 for Python. Victor also contradicts himself in his misguided tirade on Python/XML processing. "[H]aving 74 differing ways of handling XML is a bad thing," he says; yet his "contribution to bring Python's XML support up to snuff" is to add one more. Never mind that it's a fourth Python binding for libxml2. He also quotes my comment: "There is no doubt that patience for non-Pythonic ways of processing XML has worn thin" for emphasis, forgetting that his own offering is a wrapper for a very non-Pythonic API. Of course, having 74 or 75 ways of handling XML is not a bad thing, and there is room for Python purity as well as loaners from other languages. We could count to 75 for Java tools or C tools, and the sky is not falling in those communities either. Victor's addition is welcome. The users will choose.

Matteo Merli announced Pixies 0.1, "a formatter that convert XSL-FO documents to PDF. It is written in Python and is particularly focused on the production of PDF files from DocBook documents." See the announcement.

Adam Souzis announced Rx4RDF and Rhizome 0.4.1. Rx4RDF can be used for querying, transforming, and updating RDF by specifying a "deterministic mapping" of the RDF model to the XML data model defined by XPath. Rhizome is a Wiki-like content management and delivery system built on Rx4RDF. Updates address indexing, performance, security, and bugs (especially with regard to character handling). See the announcement

Walter Dörwald announced XIST 2.6. "XIST is an extensible HTML/XML generator written in Python. XIST is also a DOM parser (built on top of SAX2) with a very simple and Pythonesque tree API." This release includes a very interesting addition: "a new API named XFind has been added for iterating through XML trees. XFind expressions look somewhat like XPath expressions but are pure Python expressions." See the announcement.

Also of interest are a few articles, one old and one new. "Taking Advantage of COM with Python," by Mike Owens appeared in last month's issue of Py. In it, he uses MSXML processing as an example of COM coding. I also just rediscovered Andrew Cooke's "Practical Python, XML and DOM," an older article, but one that is still well worth a skim. Finally, Fredrik Lundh wrote a piece about using ElementTree to parse XML gradually, as it arrives in a stream.

Thanks to Daily Python-URL, as ever, for some of these links and references.