XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Appreciating Libxslt

Appreciating Libxslt

August 03, 2005

The two most well-known XSLT processors are probably the Apache project's Xalan (available in both a Java and C++ version) and the Java-based Saxon, which was written by XSLT 2.0 specification editor Michael Kay. If those are the only two XSLT processors you currently use, it's worth checking out Daniel Veillard's libxslt. Its origins (and that of libxml2, the XML processor that it uses) in the GNOME project give it a higher profile in the Linux world, but Windows and Macintosh ports are easy to install and use. Libxml2 and libxslt don't need much memory, and being written in C makes them very fast. Libxslt also provides some neat features (some actually provided by libxml2, but available to the libxslt user) with no equivalent that I've seen on either of the two more well-known XSLT processors, and they provide interfaces in C, Python, Perl, and many other languages, making it easier to embed libxslt into other programs. I managed to write a short Python script that passes an XML input document through a series of XSLT stylesheets with no disk I/O between the initial reading of the source document and the final writing of the result document, after applying the last stylesheet.

Before looking at the use of libxslt from an API, let's look at several features of the binary distribution of libxml2.

Parsing Ill-Formed HTML

Both the xmllint command-line libxml2 utility and the xsltproc command-line libxslt utility offer an -html switch that indicates that the input document is HTML and not necessarily well-formed XML. While xmllint outputs a well-formed version of the parsed HTML document, xsltproc lets you apply a stylesheet to the well-formed version that libxml2 creates for it under the covers. For example, the following command line (carriage return added for display here) applies my addids.xsl stylesheet, which adds unique IDs to block-level HTML elements, to the text version of Google's news page and saves the result in a file called gnews.xml:


xsltproc -html -o gnews.xml http://www.snee.com/xml/xslt/addids.xsl 
  http://news.google.com/news?ned=tus

Typical HTML from the public web, including this Google page, may have "features" that bring on warning and error messages from xsltproc. Still, anything that can grab such HTML, squeeze it into well-formed XML, and pass that along to a stylesheet can be an invaluable tool for web scraping. In a future column I'll look at using John Cowan's tagsoup processor, which is particularly forgiving about odd HTML, as part of an XSLT application.

XInclude

The W3C XInclude specification provides a standardized way for an XML document to specify that all or part of another document should be included at some point. Libxml2 provides the best support of XInclude that I know of, including the ability to use the W3C XPointer specification to include a very granular subset of the included document.

Since libxml2 supports this, it's available to libxslt users. If your source document includes an xi:xinclude element and you add -xinclude to the command line that invokes xsltproc, the document or subdocument specified by the xi:xinclude element will be included in the source tree, and the stylesheet passed to xsltproc will act on the included content as well.

While this isn't an XSLT feature per se, its availability to libxslt users can simplify your stylesheets. To implement something like this before XInclude was implemented, I've made up my own markup to indicate that a document should be inserted into another and I then wrote a template rule that called the document() function to make the insertion upon finding that markup. (See my earlier column Reading Multiple Input Documents for an introduction to this function.) When doing the same thing with XInclude, I'm using a W3C standard, I don't have to make up new markup, my stylesheet has less work to do, and I've got XPointer available to use as well. I've written in more detail about using XInclude and XPointer together with libxslt or libxml2 in a weblog posting.

Embedding Libxslt in Your Own Applications

Being an open source project written in C, libxslt is popular for use with embedded systems that can't spare enough memory for a Java runtime engine. Its footprint is even smaller than that of Apache C++, and it gives greater control over memory if you don't have much to go around. The libxslt tutorial on XMLSoft's website is actually a tutorial on embedding it in C applications.

For using libxslt from other programming environments, a variety of interfaces are available, including Perl, PHP, Tcl/tk, Ada, and Ruby. I first tried out the Python interface while looking for a simple way to run an XSLT stylesheet against a source document, run another against the result document, and continue for a reasonably arbitrary number of stylesheets, without reading from or writing to the disk for intermediary versions of the document being pipelined through the series of stylesheets.

The following, which is based on the basic.py script shown at The XSLT C Library for GNOME: Python and Bindings, does this. After parsing the source document into the variable sourceDoc, a for loop reads in each stylesheet named on the command line, applies it to sourceDoc, and sets sourceDoc equal to the result of the transformation, in case the loop will continue to apply another stylesheet to sourceDoc. It's pretty brief—without the white space, comments, and if statement that checks for a lack of command-line arguments, it's only fourteen lines of code.


# xsltprocs.py: send an XML source document through a
# pipeline of multiple XSLT stylesheets. 

import sys
import libxml2
import libxslt

args = len(sys.argv)

if args <  3:
    print "Pipeline an XML document through a series "
    print "of XSLT stylesheets. Usage:\n"
    print "  xsltprocs.py infile.xml stylesheet1.xsl [stylesheet2.xsl...]"
    sys.exit(0)

sourceXMLFile = sys.argv[1]
sourceDoc = libxml2.parseFile(sourceXMLFile)

for xsl in range (2,args):
    # Read in stylesheet.
    styleDoc = libxml2.parseFile(sys.argv[xsl])
    style = libxslt.parseStylesheetDoc(styleDoc)
    # Apply stylesheet to sourceDoc, save in result.
    result = style.applyStylesheet(sourceDoc, None)
    # Result becomes new sourceDoc in case we send it
    sourceDoc = result   # through another stylesheet. 

print result

style.freeStylesheet()
sourceDoc.freeDoc()

I named it xsltprocs.py in honor of the libxslt xsltproc command-line binary tool mentioned earlier. If you wanted to apply a stylesheet named stageA.xsl to the document mySource.xml, then apply the stylesheet stageB.xsl to the result of the first transformation, and stageC.xsl to the result of the second transformation, with the final result stored in result.xml, the following command line would do it:


python xsltprocs.py mySource.xml stageA.xsl stageB.xsl stageC.xsl > result.xml

Who Uses Libxslt for Production Work?

    

Also in Transforming XML

Automating Stylesheet Creation

Push, Pull, Next!

Seeking Equality

The Path of Control

Using Stylesheet Schemas

According to Viellard, both the GNOME and KDE projects use Docbook for their documentation and libxslt to generated the HTML rendered by their help tools, and the latest releases of Apple Safari use libxml2 and libxslt. He also cites banking systems, online airline ticketing systems, and embedded systems as users of libxslt.

You may have libxslt available for your use without even knowing it: it's the default XSLT engine in PHP5, and my host provider, and doubtless many other Linux-based host providers have libxslt in a directory in the default path when I log in with shell access.

Which XSLT processor should you use for XSLT 1.0 processing? Considering that libxslt, Xalan, and Saxon-B are all free, there's no reason not to keep all of them around. For production applications, the host language of your own development is an important factor, and you'll want to test each processor with your data to see which has the best performance on your system. Libxslt has enough nice features that it could be a strong contender for the XSLT processor on your next application.



1 to 1 of 1
  1. your pipeline example has a memory bug
    2005-08-05 02:35:27 Martijn Faassen
1 to 1 of 1