
Decomposition, Process, Recomposition
One unfortunate side-effect of the success of XML is that some users tend to just throw all their familiar data-processing tasks into XML shapes without pausing much to consider the strengths and weaknesses of XML.
The most common effect of this lack of attentiveness often leads people to very dangerous practices such as acting as if XML is an ASCII or even byte-oriented format. I have warned sharply about the ugly effects of such confusion in a few previous articles in this column.
Another common effect I've seen is the tendency to create multi-megabyte or even gigabyte monolithic XML files. XML is so flexible for data representation because of its nature as an annotated hierarchy. But this very nature also makes efficient processing quite difficult, especially with regards to scaling according to number of nodes.
So my first line of advice has always been: don't go processing gigabyte files in XML formats. I have been working with XML for about 8 years. I have used XML in numerous ways with numerous tools for numerous purposes. I have never come across a natural need for an XML file more than a few megabytes. There may be terabytes of raw data in a system, but if you're following XML (and classic data design) best practices, this should be structured into a number of files in some sensible way.
If you find yourself with a monster XML file -- perhaps you receive it that way from another source outside your control -- the best way to handle it is through the classic problem-solving technique, divide and conquer. Decompose the problem into smaller bits, solve each little bit to create a part solution, then combine the partial solutions into the overall solution.
Certainly this requires the problem to have certain mathematical properties, but in almost every case of monster XML files I've seen, an elegant decomposition is available. In this article I present some tools and techniques for processing large XML files by decomposition.
DOM in Bite-size Bits
I'll pretend Listing 1 is the gigantic file I have to cut down to size in order to process.
Listing 1: Sample XML file (labels.xml) containing Address Labels<?xml version="1.0" encoding="iso-8859-1"?>
<labels>
<label added="2003-06-20">
<quote>
<!-- Mixed content -->
<emph>Midwinter Spring</emph> is its own season…
</quote>
<name>Thomas Eliot</name>
<address>
<street>3 Prufrock Lane</street>
<city>Stamford</city>
<state>CT</state>
</address>
</label>
<label added="2003-06-10">
<name>Ezra Pound</name>
<address>
<street>45 Usura Place</street>
<city>Hailey</city>
<state>ID</state>
</address>
</label>
</labels>
To get into the spirit of the problem, imagine that rather than
the two label elements, this file had a billion. For
many tasks you might have in mind with such a huge file, you could
simply use SAX. Say you wanted to print out a listing of all the
names in the file. It's not hard to do this in SAX since all you
need to worry about at any point is whether you happen to be
visiting character data within a name element.
If you have an operation in mind that needs more awareness of context or state, then it's time to sharpen up your state machine chops, which, even for SAX gurus, eventually becomes a painful task. Node trees such as DOM are nice in cases where random node access makes managing state and context a cinch, but loading the entire billion-record file into memory to perform node operations is probably beyond the capabilities of all but the highest-end computers, even if, like cDomlette, it's an implementation that uses a variety of tricks to lower the memory footprint.
You might be able to make some headway by using a technique such as the one I discussed in Building Dictionaries With SAX, but perhaps the best of both worlds would be a tool that uses SAX to build little DOMs, so that you can process each, discard each, and move on to the next record.
In my current example, it would be great to have a SAX tool that
creates a DOM for each label element in turn. This
way you have the full capability of DOM available for processing
each label, but you never use much more than the memory needed for
a DOM comprising a single label. If you had such a tool, then you
could very efficiently solve problems that can be broken down to a
subproblem for each label. Such a tool is available in libxml's
xmlTextReader (see Using libxml in Python), but libxml's implementation can be very
idiosyncratic. Besides, some people dislike installing large
third-party packages.
I happened to write a SAX tool for such a purpose that works with
the built-in Python SAX and Minidom facilities and sticks to the
standard library Python SAX and SOM idioms. See Listing 2 for
sax2dom_chunker.py, a module that uses SAX to create little chunks
of DOM according to a set of element paths you give it.
"""
sax2dom_chunker.py
A SAX handler that takes a set of element paths and
creates a series of DOM chunks matching the element paths
for individual processing.
Designed for Python 2.2. or greater
Copyright 2004 Fourthought Inc, USA.
This work is licensed under Creative Commons Attribution 1.0
For details: http://creativecommons.org/licenses/by/1.0/
"""
from xml import sax
from xml.dom import XML_NAMESPACE, XMLNS_NAMESPACE, EMPTY_NAMESPACE
import xml.dom.minidom
DUMMY_DOCELEM = u'dummy'
START_STATE = 0
TOP = -1
class _state_machine:
"""
A simple state machine specialized for DOM chunking from SAX
A state is "live" when it represents the successful completion
of a path.
This is generally a signal to the handler using this state machine
to start creating the DOM fragment from the subset of SAX
events until we transit to a non-live state
"""
def __init__(self, trim_to_paths):
if not trim_to_paths:
self.event = self.event_nop
self.is_live = self.is_live_nop
return
self._state_table = {START_STATE: {}}
self._live_states = []
#Use the given trim paths to generate a state table
newest_state = START_STATE
for path in trim_to_paths:
last_state = START_STATE
for segment in path:
start_event = (1, segment[0], segment[1])
end_event = (0, segment[0], segment[1])
if self._state_table[last_state].has_key(start_event):
top_state = \
self._state_table[last_state][start_event]
else:
newest_state += 1
top_state = newest_state
self._state_table[top_state] = {}
self._state_table[last_state][start_event] = \
top_state
self._state_table[top_state][end_event] = \
last_state
last_state = top_state
self._live_states.append(top_state)
self._state = START_STATE
self.chunk_completed = 0
return
def event(self, is_start, ns, local):
"""
Register an event and effect ant state transitions
found in the state table
"""
#We only have a chunk ready for the handler in
#the explicit case below
self.chunk_completed = 0
lookup_from = self._state_table[self._state]
if lookup_from.has_key((is_start, ns, local)):
new_state = lookup_from[(is_start, ns, local)]
#If we have completed a chunk, we set a flag for
#The chunker
if (self._state in self._live_states and
new_state not in self._live_states):
self.chunk_completed = 1
self._state = new_state
return self._state
def is_live(self):
"""
1 if the curent state is considered live, else 0
"""
return self._state in self._live_states
def event_nop(self, is_start, ns, local):
pass
def is_live_nop(self):
return 1
class sax2dom_chunker(sax.ContentHandler):
"""
Note: ignores nodes prior to the document element, such as PIs and
text nodes This filter is only designed to work if you set
features sax.handler.feature_namespaces and
sax.handler.feature_namespace_prefixes to 1 on the parser you use.
It will not work on drivers that do not support these features.
The default drv_expat works fine in this case, but of course has
but very limited DTD processing. It also collapses CDATA sections
into plain text
trim_to_paths - a list of lists of tuples. Each tuple is of
the form (namespace, local-name), providing one segment
in a path of contained elements
[
[ (None, u'monty'), (None, u'python') ],
[ (None, u'monty'), (None, u'spam'), ('urn:dummy', u'eggs') ]
]
If None (the default, a DOM node will be created representing
the entire tree.
chunk_consumer - a callable object taking a DOM node. It will be
invoked as each DOM chunk is prepared.
domimpl - DOM implemention to build, e.g. mindom (the default)
or cDomlette or pxdom (if you have the right third-party
packages installed).
owner_doc - for advanced uses, if you want to use an existing
DOM document object as the owner of all created nodes.
"""
def __init__(self,
trim_to_paths=None,
chunk_consumer=None,
domimpl=xml.dom.minidom.getDOMImplementation(),
owner_doc=None
):
self._impl = domimpl
if owner_doc:
self._owner_doc = owner_doc
else:
dt = self._impl.createDocumentType(DUMMY_DOCELEM, None, '')
self._owner_doc = self._impl.createDocument(
DUMMY_DOCELEM, DUMMY_DOCELEM, dt)
#Create a docfrag to hold all the generated nodes.
root_node = self._owner_doc.createDocumentFragment()
self._nodeStack = [ root_node ]
self.state_machine = _state_machine(trim_to_paths)
self._chunk_consumer = chunk_consumer
return
def get_root_node(self):
"""
Only useful if the user does not register trim paths
If so, then after SAX processing the user can call this
method to retrieve resulting DOm representing the entire
document
"""
return self._nodeStack[0]
#Overridden DocumentHandler methods
def startElementNS(self, name, qname, attribs):
self.state_machine.event(1, name[0], name[1])
if not self.state_machine.is_live():
return
(ns, local) = name
new_element = self._owner_doc.createElementNS(ns, qname)
for ((attr_ns, lname), value) in attribs.items():
if attr_ns is not None:
attr_qname = attribs.getQNameByName((attr_ns, lname))
else:
attr_qname = lname
attr = self._owner_doc.createAttributeNS(
attr_ns, attr_qname)
attr_qname = attribs.getQNameByName((attr_ns, lname))
attr.value = value
new_element.setAttributeNodeNS(attr)
self._nodeStack.append(new_element)
return
def endElementNS(self, name, qname):
self.state_machine.event(0, name[0], name[1])
if not self.state_machine.is_live():
if (self._chunk_consumer and
self.state_machine.chunk_completed):
#Complete the element being closed because it
#Is the last bit of a DOM to be fed to the consumer
new_element = self._nodeStack[TOP]
del self._nodeStack[TOP]
self._nodeStack[TOP].appendChild(new_element)
#Feed the consumer
self._chunk_consumer(self._nodeStack[0])
#Start all over with a new doc frag so the old
#One's memory can be reclaimed
root_node = self._owner_doc.createDocumentFragment()
self._nodeStack = [ root_node ]
return
new_element = self._nodeStack[TOP]
del self._nodeStack[TOP]
self._nodeStack[TOP].appendChild(new_element)
return
def processingInstruction(self, target, data):
if self.state_machine.is_live():
pi = self._owner_doc.createProcessingInstruction(
target, data)
self._nodeStack[TOP].appendChild(pi)
return
#Overridden LexicalHandler methods
def comment(self, text):
if self.state_machine.is_live():
new_comment = self._owner_doc.createComment(text)
self._nodeStack[TOP].appendChild(new_comment)
return
def characters(self, chars):
if self.state_machine.is_live():
new_text = self._owner_doc.createTextNode(chars)
self._nodeStack[TOP].appendChild(new_text)
return
The idea here is that you provide a source XML file and a list of
element paths, each of which defines subsets of the document that
defines the boundaries of chunking. For example, to process
labels, you'd give the chunker a list [ (None, u'labels'),
(None, u'label') ] in order to generate DOM chunks
representing each label element within the
top-level labels element. You would also give the
chunker a callback function (or any callable object) that would
be invoked with each DOM chunk as it is generated.
Class _state_machine is a specialized-state machine
class for the chunker used to encapsulate the complex state logic
and keep the SAX handler simple. The sax2dom_chunker
class is a SAX handler that operates on an XML document and can
either feed the resulting DOM chunks to a consumer routine or build
and hold on to a node representing the entire document. You can use
any SAX filters you please with it as a way of increasing the
sophistication of the processing.
As an example, I have not included code to consolidate multiple contiguous text events into single ones, so the resulting DOM could be de-normalized. It also doesn't strip whitespace between elements, which you might decide you don't care about and don't want in your DOM. In either case you can use a filter upstream from the chunker. For the first case, I covered such a filter in Building Dictionaries With SAX.
Pages: 1, 2 |