XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

A Tour of 4Suite

A Tour of 4Suite

October 16, 2002

Mike Olson and I began the 4Suite project in 1998 with the release of 4DOM, and it quickly picked up an XPath and XSLT implementation. It has grown to include Python implementations of many other XML technologies, and it now provides a large library of Python APIs for XML as well as an XML server and repository system. In this article and the next, I'll introduce just the basic Python library portion of 4Suite, which includes facilities for XML parsing (complementing PyXML), RELAX NG, XPath, XPatterns, XSLT, RDF, XUpdate and more. If you are unfamiliar with any of these technologies, see the resources section at the end where I provide relevant pointers. Finally, after reviewing 4Suite, I'll summarize events in the Python-XML world since the last article.

Getting and installing 4Suite

In the general case, the only prerequisite for 4Suite is Python 2.1 or more recent. PyXML is required if you wish to parse XML in DTD validation mode, or if your Python install does not have pyexpat built in (many Python distributions do). If you need to install PyXML for these reasons, see this column's previous article.

You can get 4Suite from the project download page or from SourceForge. Get the latest 0.12.0 release. I highly recommend it over the older 0.11.1, even though the the 0.12.0 is still in testing. There has been a full redesign and many important changes which, in effect, increase stability. Windows users can just download and run the Windows executables. On other platforms (or for Windows power users), building and installing 4Suite is a matter of the standard distutils magic. After unpacking, change to the generated directory and run python setup.py install.

One useful option to the setup command is --without-docs. By default, the 4Suite build generates a large amount of documentation, and this can take a long time on some machines. It may be convenient for you to download the provided documentation packages separately and to use python setup.py install --without-docs to speed things up. 4Suite power users who install from CVS versions will find the opposite: that documentation is not built by default and that the --with-docs option is needed to build them.

Basic parsing

Parsing in 4Suite revolves around two protocols: readers and input sources. Input sources, usually based on the class Ft.Xml.InputSource.InputSource, are similar to input source objects in Python/SAX or DOM Level 3 Load and Save. They embody a stream of bytes that make up an XML document or the like, encapsulating the base URI associated with the data and some parsing preferences such as whether to process XIncludes. Reader objects actually provide methods for the XML parsing and are usually based on the classes Ft.Xml.Domlette.ValidatingReaderBase and Ft.Xml.Domlette.NonvalidatingReaderBase. Most users only need to worry about using singleton instances of these readers, which are provided for convenience. Parsing XML is as simple as the examples in listing 1, which parse XML obtained from a file, from a Web server, and then from a simple string.

Listing 1: Several examples of XML parsing

#NonvalidatingReader is a global singleton
from Ft.Xml.Domlette import NonvalidatingReader
#Parse XML from the Web...
doc = NonvalidatingReader.parseUri("http://xmlhack.com/read.php?item=1560")
#From the file system using an absolute path...
doc = NonvalidatingReader.parseUri("file:/tmp/spam.xml")
#From the file system, using a relative path...
doc = NonvalidatingReader.parseUri("file:spam.xml")
#from a string
doc = NonvalidatingReader.parseString(
        "<spam xmlns:x='http://spam.com'>eggs</spam>",
        "http://spam.com/base"
)  

Notice the second parameter in the call to parseString. This is a base URI to use for the string. In 4Suite, the base URI of any source of XML is a very important property. Used internally to manage XML resources being processed, it's very important that you provide a sensible and unique base URI for each XML source you use in parsing, even those, such as strings and file-like objects, which might not have naturally associated URIs. Remember that URIs are a superset of URLs. For most common uses, using plain URLs, including file URLs, is perfectly good enough. In the parseUri method call, the URI from which the XML is parsed is naturally assumed as the base URI of the resulting parsed XML. When using any other parsing method, you should provide the URI explicitly, as in the example above. If you wish to use DTD validation while parsing, replace the NonvalidatingReader references in the example with ValidatingReader.

There are many options, elaborations, and nuances to the parsing tools I've introduced here. You can configure almost all aspects of the parsing. The doc object obtained from the various parsing methods in the listing 1 is a DOM node instance from either the cDomlette or FtMinidom implementations. cDomlette is a very fast and compact DOM written in C, and is the default on platforms that support it; FtMinidom is an enhanced version of Python's minidom. You can perform most DOM operations on either type of node.

RELAX NG

If DTDs don't suit your needs, 4Suite provides another option: RELAX NG. 4Suite incorporates Eric van der Vlist's XVIF implementation, which is basically RELAX NG with some very useful extensions. RELAX NG validation is not built into the default readers, but it is easy enough to do as a separate step, as shown in listing 2.

Listing 2: Using RELAX NG

#RELAX NG schema file
RNG = """<?xml version='1.0' encoding='UTF-8'?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
  <start>
    <element name="memo">
      <element name="title">
        <text/>
      </element>
      <element name="date">
        <attribute name="form">
          <text/>
        </attribute>
        <text/>
      </element>
      <element name="to">
        <text/>
      </element>
      <element name="body">
        <text/>
      </element>
    </element>
  </start>
</grammar>
"""

#Instance document
DOC = """<?xml version='1.0' encoding='UTF-8'?>
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>
<date form="ISO-8601">1936-04-03</date>
<to>The Art World</to>
<body>
It has come to our attention that the basis for art production
Has shifted from keen patronage to vulgar commercial measure.
Management is concerned this will erode the lasting value of the age's works.
</body>
</memo>"""


from Ft.Xml.Xvif import RelaxNgValidator
from Ft.Xml import InputSource
factory = InputSource.DefaultFactory
rng_isrc = factory.fromString(RNG, "file:example2.rng")
xml_isrc = factory.fromString(DOC, "file:example2.xml")

validator = RelaxNgValidator(rng_isrc)
result = validator.isValid(xml_isrc)
if result:
    print "Valid"
else:
    print "Invalid"

The RELAX NG APIs, like many in 4Suite, take input source objects -- though they usually have convenience APIs to pass in strings, URIs, or even prepared DOM nodes. Rather than use a reader object directly to parse the XML strings, I create input sources based on each. I do so using an input source factory, which has methods for generating input sources from string, URI, and so on. The Ft.Xml.Xvif.RelaxNgValidator class represents a RELAX NG schema, which is read from the input source given in the initializer. The validator can then be used to validate any number of XML instance documents, in this case using the isValid method. If you want more detail than a yes-or-no to validity, you can use the validate method, which returns a special object with some validation details.

A RELAX NG alternative

Andrew Kuchling also has a partial RELAX NG implementation for Python. It's in the PyXML project's CVS repository but is not distributed with the PyXML package yet. It supports less of the RELAX NG standard than XVIF, but it is still useful. If you want to try it, grab the sandbox module of PyXML using the following commands, or their equivalent in your CVS environment of choice:

cvs -d:pserver:anonymous@cvs.pyxml.sourceforge.net:/cvsroot/pyxml login
cvs -z3 -d:pserver:anonymous@cvs.pyxml.sourceforge.net:/cvsroot/pyxml co sandbox  

Look in the directory sandbox/relaxng. It is not clear right now whether the two RELAX NG implementations will ever merge, or whether they will continue to develop separately as mutual alternatives.

Pages: 1, 2

Next Pagearrow







close