Decomposition, Process, Recomposition
by Uche Ogbuji
|
Pages: 1, 2
Using the DOM Chunker
Listing 3 is an example of how to use the chunker to process each label as an isolated DOM chunk. The goal is to print out all the cities in which someone named "Eliot" lives.
Listing 3: Uses the Chunker in Listing 2 to Process the Labels File Label by Label"""
Print out all the cities in which someone named "Eliot" lives
"""
from xml import sax
import sax2dom_chunker
SEARCH_FOR_STRING = 'eliot'
def process_label(docfrag):
#Invoked for each label
label = docfrag.firstChild
name = label.getElementsByTagNameNS(None, 'name')[0]
city = label.getElementsByTagNameNS(None, 'city')[0]
#For simplicity, pretend the above nodes are normalized, and have
#No child elements
name_text = name.firstChild.data
city_text = city.firstChild.data
if name_text.lower().find(SEARCH_FOR_STRING) != -1:
print city_text.encode('utf-8')
return
#The path for chunking by label element
#Equivalent to the XPattern /labels/label
LABEL_PATH = [ (None, u'labels'), (None, u'label') ]
#Create an instance of the chunker
handler = sax2dom_chunker.sax2dom_chunker(
trim_to_paths=[LABEL_PATH],
chunk_consumer=process_label
)
parser = sax.make_parser()
#The chunker is the SAX handler
parser.setContentHandler(handler)
#The chunker *requires* these features
parser.setFeature(sax.handler.feature_namespaces, 1)
parser.setFeature(sax.handler.feature_namespace_prefixes, 1)
parser.parse('labels.xml')
The result of running python listing3.py labels.xml is simply
"Stamford" printed to the console. Again, the neat trick that comes
from using chunker is that even though we take advantage of the
relatively friendly DOM facility (it would be friendlier still if I
were using XPaths, but that would once again require a third-party
tool), the program does not take much more memory for a billion-label file than for a two-label file. Listing 4 is a similar script
that shows how you can create chunks according to multiple paths.
"""
Print out all people and their street address
"""
from xml import sax
import sax2dom_chunker
#Paths for chunking by label element
#Equivalent to the XPattern
# ( /labels/label/name|/labels/label/address/street )
PATHS = [
[ (None, u'labels'), (None, u'label'), (None, u'name') ],
[ (None, u'labels'), (None, u'label'), (None, u'address'),
(None, u'street') ]
]
def process_chunk(docfrag):
#Invoked for each name or address. We're getting the leaf element
#of the path itself (in a doc frag wrapper) so just print its text
#content
text = docfrag.firstChild.firstChild.data
print text.encode('utf-8')
return
#Create an instance of the chunker
handler = sax2dom_chunker.sax2dom_chunker(
trim_to_paths=PATHS,
chunk_consumer=process_chunk
)
parser = sax.make_parser()
#The chunker is the SAX handler
parser.setContentHandler(handler)
#The chunker *requires* these features
parser.setFeature(sax.handler.feature_namespaces, 1)
parser.setFeature(sax.handler.feature_namespace_prefixes, 1)
parser.parse('labels.xml')
The output is in Listing 5.
Listing 5: Output of "python listing4.py labels.xml"Thomas Eliot
3 Prufrock Lane
Ezra Pound
45 Usura Place
If you have 4Suite installed, you can save even more memory (and gain some speed) by using the cDomlette implementation. You can also use Andrew Clover's pxdom (to gain W3C DOM conformance, but not performance). In fact, you can use any DOM that follows Python-library DOM conventions by changing the implementation. As an example, to use cDomlette you would replace the following snippet of code in Listing 3:
#Create an instance of the chunker
handler = sax2dom_chunker.sax2dom_chunker(
trim_to_paths=[LABEL_PATH],
chunk_consumer=process_label
)
With the following snippet:
from Ft.Xml.Domlette import implementation
#Create an instance of the chunker
handler = sax2dom_chunker.sax2dom_chunker(
domimpl = implementation,
trim_to_paths=[LABEL_PATH],
chunk_consumer=process_label
)
Wrap Up
This technique is much more generally applicable than the SAX dictionary generator I presented earlier. It could be made more general still if the chunker state processing could be adapted to use the full power of XPattern, so that you could process, say the first 10 records in a billion record file and not create DOM chunks for the rest (/labels/label[position() < 10]).
I intend to write a more powerful chunker along these lines, but it would use the XPattern parsing facilities in 4Suite. The chunker I presented in this article works without third-party packages and its element path implementation probably covers the 80% use-case for such decomposition.
In this month's news from the field, Adam Souzis announced Rx4RDF and Rhizome 0.3.0, an update to the RDF-based web application framework and Wiki toolkit. Changes include documentation and security improvements and Wiki feature enhancements. See the full announcement.
Also in Python and XML | |
Should Python and XML Coexist? | |
Walter Dörwald announced XIST 2.5. Billed as an "object-oriented XSLT", XIST uses an easily extensible, DOM-like view of source and target XML documents to generate HTML. "Every XML element type corresponds to a Python class, and these Python classes provide a conversion method to transform the XML tree (e.g., into HTML) ." This release features some API improvements, schema validation (apparently not based on any portable schema specification), support for Holger Krekel's XPython (see below), bug fixes, and more. See the announcement.
Holger Krekel presented at EuroPython 2004 a concept he calls XPython, a new templating syntax for XML and HTML generation that would use extensions to Python syntax to embed templates closely into code. He currently has an experimental implementation consisting of a "300 line patch to C Python."
In other EuroPython 2004 news, Martijn Faassen is working on lxml -- "a sane Python wrapper for libxml." Currently available only in CVS, lxml addresses the concern that the official libxml bindings are not Pythonic (too close to the underlying C), not Python Unicode aware (UTF-8 only), unsafe (can cause core dumps), require manual memory management, and poorly documented. lxml is written using Pyrex and for now only supports the most basic element tree APIs.
- pulldom?
2004-08-04 06:26:34 Jakob_Lund - pulldom?
2004-08-04 07:18:44 Uche Ogbuji - Expatreader.py error (line 156)
2004-08-04 02:59:10 m breeb - Expatreader.py error (line 156)
2004-08-06 19:32:07 Uche Ogbuji - lxml moved
2004-08-02 05:29:12 Martijn Faassen