A Tour of 4Suite
Mike Olson and I began the 4Suite project in 1998 with the release of 4DOM, and it quickly picked up an XPath and XSLT implementation. It has grown to include Python implementations of many other XML technologies, and it now provides a large library of Python APIs for XML as well as an XML server and repository system. In this article and the next, I'll introduce just the basic Python library portion of 4Suite, which includes facilities for XML parsing (complementing PyXML), RELAX NG, XPath, XPatterns, XSLT, RDF, XUpdate and more. If you are unfamiliar with any of these technologies, see the resources section at the end where I provide relevant pointers. Finally, after reviewing 4Suite, I'll summarize events in the Python-XML world since the last article.
In the general case, the only prerequisite for 4Suite is Python 2.1 or more recent. PyXML is required if you wish to parse XML in DTD validation mode, or if your Python install does not have pyexpat built in (many Python distributions do). If you need to install PyXML for these reasons, see this column's previous article.
You can get 4Suite from the project download page or
from
SourceForge. Get the latest 0.12.0 release. I highly recommend
it over the older 0.11.1, even though the the 0.12.0 is still in
testing. There has been a full redesign and many important changes
which, in effect, increase stability. Windows users can just
download and run the Windows executables. On other platforms (or
for Windows power users), building and installing 4Suite is a matter
of the standard distutils magic. After unpacking, change
to the generated directory and run python setup.py
install.
One useful option to the setup command is
--without-docs. By default, the 4Suite build generates
a large amount of documentation, and this can take a long time on
some machines. It may be convenient for you to download the
provided documentation packages separately and to use python
setup.py install --without-docs to speed things up. 4Suite
power users who install from CVS versions will find the opposite:
that documentation is not built by default and that the
--with-docs option is needed to build them.
Parsing in 4Suite revolves around two protocols: readers and input
sources. Input sources, usually based on the class
Ft.Xml.InputSource.InputSource, are similar to input
source objects in Python/SAX or DOM Level 3 Load and Save. They
embody a stream of bytes that make up an XML document or the like,
encapsulating the base URI associated with the data and some parsing
preferences such as whether to process XIncludes. Reader objects
actually provide methods for the XML parsing and are usually based on
the classes Ft.Xml.Domlette.ValidatingReaderBase and
Ft.Xml.Domlette.NonvalidatingReaderBase. Most users only
need to worry about using singleton instances of these readers, which
are provided for convenience. Parsing XML is as simple as the
examples in listing 1, which parse XML obtained from a file, from a
Web server, and then from a simple string.
#NonvalidatingReader is a global singleton
from Ft.Xml.Domlette import NonvalidatingReader
#Parse XML from the Web...
doc = NonvalidatingReader.parseUri("http://xmlhack.com/read.php?item=1560")
#From the file system using an absolute path...
doc = NonvalidatingReader.parseUri("file:/tmp/spam.xml")
#From the file system, using a relative path...
doc = NonvalidatingReader.parseUri("file:spam.xml")
#from a string
doc = NonvalidatingReader.parseString(
"<spam xmlns:x='http://spam.com'>eggs</spam>",
"http://spam.com/base"
)
Notice the second parameter in the call to
parseString. This is a base URI to use for the string.
In 4Suite, the base URI of any source of XML is a very important
property. Used internally to manage XML resources being processed,
it's very important that you provide a sensible and unique base URI
for each XML source you use in parsing, even those, such as strings
and file-like objects, which might not have naturally associated
URIs. Remember that URIs are a superset of URLs. For most common
uses, using plain URLs, including file URLs, is perfectly good
enough. In the parseUri method call, the URI from
which the XML is parsed is naturally assumed as the base URI of the
resulting parsed XML. When using any other parsing method, you
should provide the URI explicitly, as in the example above. If you
wish to use DTD validation while parsing, replace the
NonvalidatingReader references in the example with
ValidatingReader.
There are many options, elaborations, and nuances to the parsing
tools I've introduced here. You can configure almost all aspects of
the parsing. The doc object obtained from the various
parsing methods in the listing 1 is a DOM node instance from either
the cDomlette or FtMinidom implementations. cDomlette
is a very fast and compact DOM written in C, and is the default on
platforms that support it; FtMinidom is an enhanced version of
Python's minidom. You can perform most DOM operations on either
type of node.
If DTDs don't suit your needs, 4Suite provides another option: RELAX NG. 4Suite incorporates Eric van der Vlist's XVIF implementation, which is basically RELAX NG with some very useful extensions. RELAX NG validation is not built into the default readers, but it is easy enough to do as a separate step, as shown in listing 2.
#RELAX NG schema file
RNG = """<?xml version='1.0' encoding='UTF-8'?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<element name="memo">
<element name="title">
<text/>
</element>
<element name="date">
<attribute name="form">
<text/>
</attribute>
<text/>
</element>
<element name="to">
<text/>
</element>
<element name="body">
<text/>
</element>
</element>
</start>
</grammar>
"""
#Instance document
DOC = """<?xml version='1.0' encoding='UTF-8'?>
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>
<date form="ISO-8601">1936-04-03</date>
<to>The Art World</to>
<body>
It has come to our attention that the basis for art production
Has shifted from keen patronage to vulgar commercial measure.
Management is concerned this will erode the lasting value of the age's works.
</body>
</memo>"""
from Ft.Xml.Xvif import RelaxNgValidator
from Ft.Xml import InputSource
factory = InputSource.DefaultFactory
rng_isrc = factory.fromString(RNG, "file:example2.rng")
xml_isrc = factory.fromString(DOC, "file:example2.xml")
validator = RelaxNgValidator(rng_isrc)
result = validator.isValid(xml_isrc)
if result:
print "Valid"
else:
print "Invalid"
The RELAX NG APIs, like many in 4Suite, take input source objects
-- though they usually have convenience APIs to pass in strings,
URIs, or even prepared DOM nodes. Rather than use a reader object
directly to parse the XML strings, I create input sources based on
each. I do so using an input source factory, which has methods for
generating input sources from string, URI, and so on. The
Ft.Xml.Xvif.RelaxNgValidator class represents a RELAX
NG schema, which is read from the input source given in the
initializer. The validator can then be used to validate any number
of XML instance documents, in this case using the
isValid method. If you want more detail than a
yes-or-no to validity, you can use the validate method,
which returns a special object with some validation details.
Andrew Kuchling also has a partial RELAX NG implementation for
Python. It's in the PyXML project's CVS repository but is not
distributed with the PyXML package yet. It supports less of the RELAX
NG standard than XVIF, but it is still useful. If you want to try it,
grab the sandbox module of PyXML using the following
commands, or their equivalent in your CVS environment of choice:
cvs -d:pserver:anonymous@cvs.pyxml.sourceforge.net:/cvsroot/pyxml login
cvs -z3 -d:pserver:anonymous@cvs.pyxml.sourceforge.net:/cvsroot/pyxml co sandbox
Look in the directory sandbox/relaxng. It is not
clear right now whether the two RELAX NG implementations will ever
merge, or whether they will continue to develop separately as mutual
alternatives.
|
XPath is everywhere. It's established itself as the workhorse of
XML processing. The XPath engine is one of the parts of 4Suite that
has had the most development and exercise. Much of it is
implemented in C for performance sake, and this is one of the key
differences between the XPath library in current 4Suite and that in
PyXML, which is based on an older release of 4XPath, and is almost
entirely in Python. The easiest way to use the XPath library is
through the functions in Ft.Xml.XPath. Listing 3
defines a function for extracting the title from any given XHTML 1.0
file, using XPath.
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml.Domlette import NonvalidatingReader
XHTML_NS = "http://www.w3.org/1999/xhtml"
#compile the XPath for retrieving XHTML titles
TITLE_EXPR = Compile("string(/h:html/h:head/h:title)")
def extract_xhtml_title(uri):
"""Extract the title from the XHTML document at the given URI"""
doc = NonvalidatingReader.parseUri(uri)
#set up the context with the XHTML document node
#and namespace mapping from the "h" prefix to the XHTML URI
context = Context(doc, processorNss={"h": XHTML_NS})
#Compute the XPath against the context
title = TITLE_EXPR.evaluate(context)
return title
The Context class is a very important one. During
XPath processing, it maintains a lot of state information, including
the context items defined in the XPath spec. The most important
item in the context is the context node, which I set to the document
node of the XHTML file. In this case, I also use the context to
hold the namespace mapping from the "h" prefix which I use to the
XHTML namespace. At the global level, I compile the XPath object,
which is similar to compiling a regular expression using
re.compile(). The result is a parsed XPath object
which has an evaluate method taking a plain node object
or a full context object. The return value is a Python equivalent
of one of the four XPath data types. Strings are returned as Python
Unicode objects, numbers as Python floats, booleans as instances of
a special boolean class, and node sets as Python lists of node
objects. The XPath expression above returns a string, which is
directly returned to the caller as the requested title.
XSLT defines XPattern, a variation on XPath which is used to declare rules for matching patterns in the XML source against which to fire XSLT templates. The XPattern implementation that 4Suite's XSLT library uses is also exposed as a library of its own. XPatterns are useful when the task is not so much to compute arbitrary information from a given node but, rather, to choose quickly from a collection of nodes the ones that meet some basic rules. This might seem a subtle difference. The following example might help illustrate it.
class attribute from all the child elements of the context nodeclass attribute and those that have a title childThe main API for XPattern processing in 4Suite is
Ft.Xml.Xslt.PatternList. Listing 4 is a code snippet
that takes a node and returns a list of patterns it matches.
from Ft.Xml.Xslt import PatternList
from Ft.Xml.Domlette import NonvalidatingReader
#first pattern matches nodes with an href attribute
#the second matches elements with a title child
PATTERNS = ["*[@class]", "*[title]"]
#Second parameter is a dictionary of prefix to namespace mappings
plist = PatternList(PATTERNS, {})
DOC = """<spam>
<e1 class="1"/>
<e2><title>A</title></e2>
<e3 class="2"><title>B</title></e3>
</spam>
"""
doc = NonvalidatingReader.parseString(DOC, "file:example4.xml")
for node in doc.documentElement.childNodes:
#Don't forget that the white space text nodes before and after
#e1, e2 and e3 elements are also child nodes of the spam element
if node.nodeName[0] == "e":
print plist.lookup(node)
The PatternList initializer takes my list of strings, which it
conveniently converts to a list of compiled XPattern objects. Such
objects have a match method that returns a boolean
value, but I don't use these methods directly in this example. The
PatternList initializer also takes a dictionary that makes up the
namespace mapping. In this example, we use no namespaces, so the
dictionary is empty. The lookup method is applied to a
selection of the children of the spam element (all the
nodes whose name starts with "e", which happens to be all the
element nodes). The output of listing 4 follows:
[*[attribute::class]]
[*[child::title]]
[*[attribute::class], *[child::title]]
The output is a list of the representations of the pattern objects that matched each node. Notice how the axis abbreviations have been expanded in the pattern object representation.
Sometimes the built-in facilities of XPath and XPattern aren't quite enough to meet your processing needs. Luckily it's easy to extend the function of these libraries using XPath user extension functions, which are written in Python. I don't cover extension functions in this article, but the resources section has pointers to useful information if you need this facility.
Here is a brief on significant new happenings relevant to Python-XML development, including significant software releases.
Also in Python and XML | |
Should Python and XML Coexist? | |
David Mertz announced the 1.0.4 release of gnosis XML tools. This package provides tools for converting Python objects to XML documents and vice versa, DTD to SQL conversions, and more.
Brian Quinlan announced Pyana 0.6.0. Pyana is a Python extension module for interface to the Xalan XSLT engine.
Eric van der Vlist announced XVIF 0.2.0. XVIF includes a full RELAX NG validator for Python and adds in an XML processing framework system Eric developed as a straw man for ISO DSDL. The new release adds a data typing framework and a partial WXS data types library. It also features improved internals and API.
Henry Thompson announced a new release of XSV, a Python implementation of W3C XML Schema (WXS) which also runs the W3C's on-line WXS validator service. This is release features a major restructuring of the code.
Frank Tobin announced a lightweight Python module to help write out well-formed XML. xmlprinter is inspired by Perl's XML::Writer module.
Daniel Veillard announced the 1.0.21 release of libxslt, with improved Python bindings, among other things.
4Suite 0.12.0a3 is released, which is the version I introduce in this article. Among many other changes and improvements, it includes the latest XVIF.
Resources
|
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.