XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Decomposition, Process, Recomposition

Decomposition, Process, Recomposition

July 28, 2004

One unfortunate side-effect of the success of XML is that some users tend to just throw all their familiar data-processing tasks into XML shapes without pausing much to consider the strengths and weaknesses of XML.

The most common effect of this lack of attentiveness often leads people to very dangerous practices such as acting as if XML is an ASCII or even byte-oriented format. I have warned sharply about the ugly effects of such confusion in a few previous articles in this column.

Another common effect I've seen is the tendency to create multi-megabyte or even gigabyte monolithic XML files. XML is so flexible for data representation because of its nature as an annotated hierarchy. But this very nature also makes efficient processing quite difficult, especially with regards to scaling according to number of nodes.

So my first line of advice has always been: don't go processing gigabyte files in XML formats. I have been working with XML for about 8 years. I have used XML in numerous ways with numerous tools for numerous purposes. I have never come across a natural need for an XML file more than a few megabytes. There may be terabytes of raw data in a system, but if you're following XML (and classic data design) best practices, this should be structured into a number of files in some sensible way.

If you find yourself with a monster XML file -- perhaps you receive it that way from another source outside your control -- the best way to handle it is through the classic problem-solving technique, divide and conquer. Decompose the problem into smaller bits, solve each little bit to create a part solution, then combine the partial solutions into the overall solution.

Certainly this requires the problem to have certain mathematical properties, but in almost every case of monster XML files I've seen, an elegant decomposition is available. In this article I present some tools and techniques for processing large XML files by decomposition.

DOM in Bite-size Bits

I'll pretend Listing 1 is the gigantic file I have to cut down to size in order to process.

Listing 1: Sample XML file (labels.xml) containing Address Labels
<?xml version="1.0" encoding="iso-8859-1"?>
  <label added="2003-06-20">
      <!-- Mixed content -->
      <emph>Midwinter Spring</emph> is its own season&#8230;
    <name>Thomas Eliot</name>
      <street>3 Prufrock Lane</street>
  <label added="2003-06-10">
    <name>Ezra Pound</name>
      <street>45 Usura Place</street>


To get into the spirit of the problem, imagine that rather than the two label elements, this file had a billion. For many tasks you might have in mind with such a huge file, you could simply use SAX. Say you wanted to print out a listing of all the names in the file. It's not hard to do this in SAX since all you need to worry about at any point is whether you happen to be visiting character data within a name element.

If you have an operation in mind that needs more awareness of context or state, then it's time to sharpen up your state machine chops, which, even for SAX gurus, eventually becomes a painful task. Node trees such as DOM are nice in cases where random node access makes managing state and context a cinch, but loading the entire billion-record file into memory to perform node operations is probably beyond the capabilities of all but the highest-end computers, even if, like cDomlette, it's an implementation that uses a variety of tricks to lower the memory footprint.

You might be able to make some headway by using a technique such as the one I discussed in Building Dictionaries With SAX, but perhaps the best of both worlds would be a tool that uses SAX to build little DOMs, so that you can process each, discard each, and move on to the next record.

In my current example, it would be great to have a SAX tool that creates a DOM for each label element in turn. This way you have the full capability of DOM available for processing each label, but you never use much more than the memory needed for a DOM comprising a single label. If you had such a tool, then you could very efficiently solve problems that can be broken down to a subproblem for each label. Such a tool is available in libxml's xmlTextReader (see Using libxml in Python), but libxml's implementation can be very idiosyncratic. Besides, some people dislike installing large third-party packages.

I happened to write a SAX tool for such a purpose that works with the built-in Python SAX and Minidom facilities and sticks to the standard library Python SAX and SOM idioms. See Listing 2 for sax2dom_chunker.py, a module that uses SAX to create little chunks of DOM according to a set of element paths you give it.

Listing 2: Creates DOM Nodes using SAX

A SAX handler that takes a set of element paths and
creates a series of DOM chunks matching the element paths
for individual processing.

Designed for Python 2.2. or greater

Copyright 2004 Fourthought Inc, USA.
This work is licensed under Creative Commons Attribution 1.0
For details: http://creativecommons.org/licenses/by/1.0/

from xml import sax
import xml.dom.minidom

DUMMY_DOCELEM = u'dummy'
TOP = -1

class _state_machine:
    A simple state machine specialized for DOM chunking from SAX
    A state is "live" when it represents the successful completion
    of a path.
    This is generally a signal to the handler using this state machine
    to start creating the DOM fragment from the subset of SAX
    events until we transit to a non-live state
    def __init__(self, trim_to_paths):
        if not trim_to_paths:
            self.event = self.event_nop
            self.is_live = self.is_live_nop
        self._state_table = {START_STATE: {}}
        self._live_states = []
        #Use the given trim paths to generate a state table
        newest_state = START_STATE
        for path in trim_to_paths:
            last_state = START_STATE
            for segment in path:
                start_event = (1, segment[0], segment[1])
                end_event = (0, segment[0], segment[1])
                if self._state_table[last_state].has_key(start_event):
                    top_state = \
                    newest_state += 1
                    top_state = newest_state
                    self._state_table[top_state] = {}
                self._state_table[last_state][start_event] = \
                self._state_table[top_state][end_event] = \
                last_state = top_state
        self._state = START_STATE
        self.chunk_completed = 0

    def event(self, is_start, ns, local):
        Register an event and effect ant state transitions
        found in the state table
        #We only have a chunk ready for the handler in
        #the explicit case below
        self.chunk_completed = 0
        lookup_from = self._state_table[self._state]
        if lookup_from.has_key((is_start, ns, local)):
             new_state = lookup_from[(is_start, ns, local)]
             #If we have completed a chunk, we set a flag for
             #The chunker
             if (self._state in self._live_states and
                 new_state not in self._live_states):
                 self.chunk_completed = 1
             self._state = new_state
        return self._state

    def is_live(self):
        1 if the curent state is considered live, else 0
        return self._state in self._live_states

    def event_nop(self, is_start, ns, local):

    def is_live_nop(self):
        return 1

class sax2dom_chunker(sax.ContentHandler):
    Note: ignores nodes prior to the document element, such as PIs and
    text nodes This filter is only designed to work if you set
    features sax.handler.feature_namespaces and
    sax.handler.feature_namespace_prefixes to 1 on the parser you use.
    It will not work on drivers that do not support these features.
    The default drv_expat works fine in this case, but of course has
    but very limited DTD processing.  It also collapses CDATA sections
    into plain text

    trim_to_paths - a list of lists of tuples.  Each tuple is of
        the form (namespace, local-name), providing one segment
        in a path of contained elements
          [ (None, u'monty'), (None, u'python') ],
          [ (None, u'monty'), (None, u'spam'), ('urn:dummy', u'eggs') ]
        If None (the default, a DOM node will be created representing
        the entire tree.

    chunk_consumer - a callable object taking a DOM node.  It will be
        invoked as each DOM chunk is prepared.
    domimpl - DOM implemention to build, e.g. mindom (the default)
        or cDomlette or pxdom (if you have the right third-party
        packages installed).
    owner_doc - for advanced uses, if you want to use an existing
        DOM document object as the owner of all created nodes.
    def __init__(self,
        self._impl = domimpl
        if owner_doc:
            self._owner_doc = owner_doc
            dt = self._impl.createDocumentType(DUMMY_DOCELEM, None, '')
            self._owner_doc = self._impl.createDocument(
                DUMMY_DOCELEM, DUMMY_DOCELEM, dt)
        #Create a docfrag to hold all the generated nodes.
        root_node = self._owner_doc.createDocumentFragment()
        self._nodeStack = [ root_node ]
        self.state_machine = _state_machine(trim_to_paths)
        self._chunk_consumer = chunk_consumer

    def get_root_node(self):
        Only useful if the user does not register trim paths
        If so, then after SAX processing the user can call this
        method to retrieve resulting DOm representing the entire
        return self._nodeStack[0]

    #Overridden DocumentHandler methods
    def startElementNS(self, name, qname, attribs):
        self.state_machine.event(1, name[0], name[1])
        if not self.state_machine.is_live():
        (ns, local) = name
        new_element = self._owner_doc.createElementNS(ns, qname)

        for ((attr_ns, lname), value) in attribs.items():
            if attr_ns is not None:
                attr_qname = attribs.getQNameByName((attr_ns, lname))
                attr_qname = lname
            attr = self._owner_doc.createAttributeNS(
                attr_ns, attr_qname)
            attr_qname = attribs.getQNameByName((attr_ns, lname))
            attr.value = value


    def endElementNS(self, name, qname):
        self.state_machine.event(0, name[0], name[1])
        if not self.state_machine.is_live():
            if (self._chunk_consumer and
                #Complete the element being closed because it
                #Is the last bit of a DOM to be fed to the consumer
                new_element = self._nodeStack[TOP]
                del self._nodeStack[TOP]
                #Feed the consumer
                #Start all over with a new doc frag so the old
                #One's memory can be reclaimed
                root_node = self._owner_doc.createDocumentFragment()
                self._nodeStack = [ root_node ]
        new_element = self._nodeStack[TOP]
        del self._nodeStack[TOP]

    def processingInstruction(self, target, data):
        if self.state_machine.is_live():
            pi = self._owner_doc.createProcessingInstruction(
                target, data)

    #Overridden LexicalHandler methods
    def comment(self, text):
        if self.state_machine.is_live():
            new_comment = self._owner_doc.createComment(text)

    def characters(self, chars):
        if self.state_machine.is_live():
            new_text = self._owner_doc.createTextNode(chars)

The idea here is that you provide a source XML file and a list of element paths, each of which defines subsets of the document that defines the boundaries of chunking. For example, to process labels, you'd give the chunker a list [ (None, u'labels'), (None, u'label') ] in order to generate DOM chunks representing each label element within the top-level labels element. You would also give the chunker a callback function (or any callable object) that would be invoked with each DOM chunk as it is generated.

Class _state_machine is a specialized-state machine class for the chunker used to encapsulate the complex state logic and keep the SAX handler simple. The sax2dom_chunker class is a SAX handler that operates on an XML document and can either feed the resulting DOM chunks to a consumer routine or build and hold on to a node representing the entire document. You can use any SAX filters you please with it as a way of increasing the sophistication of the processing.

As an example, I have not included code to consolidate multiple contiguous text events into single ones, so the resulting DOM could be de-normalized. It also doesn't strip whitespace between elements, which you might decide you don't care about and don't want in your DOM. In either case you can use a filter upstream from the chunker. For the first case, I covered such a filter in Building Dictionaries With SAX.

Pages: 1, 2

Next Pagearrow