An Introduction to Streaming Transformations for XML
February 26, 2003
This article introduces Streaming Transformations for XML (STX), a template-based XML transformation language that operates on streams of SAX events. STX resembles XSLT 1.0, the tree-driven transformation language for XML, but STX offers unique features and advantages for some applications.
XSLT's popularity has grown over the past three years, both aiding and riding on the adoption of XML. In comparison to API-level programming with Document Object Model (DOM), XSLT provides a loosely-typed, declarative environment tailored for tree-oriented transformation of XML documents, which has achieved wide adoption as a general-purpose XML manipulation tool despite the proscription:
...XSLT is not intended as a completely general-purpose XML transformation language. Rather it is designed primarily for the kinds of transformations that are needed when XSLT is used as part of XSL. — XSLT 1.0
DOM provides an overall view of an XML document through tree traversal and manipulation. DOM is heavyweight, however, in that it typically imposes a memory footprint of around five times the size of the underlying XML text for simple documents. DOM also imposes a significant time overhead for creating the necessary objects.
SAX provides a sequential view of an XML document through a stream of events. SAX-based programs typically maintain some amount of state information that encapsulates already-received events, but SAX processing requires a negligible amount of memory (typically only the representation of the current event and the buffer for the parser).
SAX is the event-oriented sibling of the DOM API. (See the sidebar for a short discussion of DOM versus SAX.) STX provides a streaming analog for XSLT by adopting some of the now familiar concepts from XSLT (e.g., matching based on templates and an XPath 1.0-like expression language) but using SAX as the underlying interface to the XML document. In line with the proscription about XSLT, STX is neither a general purpose XML transformation language, nor is it an attempt to improve, extend, or replace XSLT.
Like SAX, STX is a completely free, grassroots effort by the XML community, initiated by Petr Cimprich; the specification and a mailing list are hosted on SourceForge. The current version of the STX specification contains a list of other contributors. There are currently two STX processor implementations:
This section illustrates some of the relative strengths of STX through an example. The scenario is one of a nursery that stocks a selection of trees and flowers and receives orders in a simple XML format, illustrated below. (The full text of the example order document is available in XML.) Every order consists of the name, an optional quantity which defaults to 1, and a price per unit quantity.
<?xml version="1.0"?> <order-list> <order> <name>Achillea Millefolium</name> <quantity>50</quantity> <price>2.93</price> </order> ... <order> <name>Camelia Japonica</name> <quantity>20</quantity> <price>39.90</price> </order> <order> <name>Ginkgo Biloba</name> <price>19.90</price> </order> ... </order-list>
The initial goal is to create a stylesheet that computes the total price (the sum over all items of the the quantity multiplied by the unit price). This goal could be accomplished with XSLT, using recursive invocation of named templates with parameter passing.
The output from the transformation is, as expected, an XHTML document containing the order formatted into a table and the total cost computed.
Running the Example with Joost
To execute the transformation with Joost under JDK 1.4 (the SAX parser, a version of Crimson, bundled with JDK 1.4 is less than perfect; Ã;†lfred2 or Xerces-J2 both make better choices for general use XML parsers), download binary distributions of Joost and log4j and then run:
java -cp joost.jar;log4j-1.2.7.jar net.sf.joost.Main plants.xml order-sum.stx
The process should complete almost instanteously. On a reasonable PC, Joost will process around 1Mb per second through the example stylesheet. This was estimated by timing the processing of a 250,000 line-item order.
Running the stylesheet with STX::XML
To execute the transformation with STX::XML under Perl, install the STX::XML module and the prerequisite Perl modules XML::SAX and XML::NamespaceSupport from CPAN. (The modules XML::SAX::Expat and XML::SAX::Writer are also recommended but not required.) Then execute the stxcmd.pl script (located in the STX::XML installation directory) as follows:
A quick glance over the example stylesheet should remind the reader of an XSLT stylesheet:
|STX stylesheets are well-formed XML and can thus be edited with any text or XML editor.|
|STX stylesheets consist of a mixture of instructions and declarations in the STX namespace, literal result elements, and content.|
Furthering the resemblance, the STX namespace (http://stx.sourceforge.net/2002/ns) shares many elements (by local name) with the XSLT 1.0 namespace (http://www.w3.org/1999/XSL/Transform), and these elements provide similar functions to their XSLT 1.0 counterparts:
|The transform element encloses the stylesheet. (STX does not support the "literal result element as stylesheet" syntax described in section 2.3 of the XSLT 1.0 specification.)|
|STX provides variables, both global and local. (STX and XSLT 1.0 treat variables differently, however.)|
|STX uses declarative, rule-based matching to drive the flow of stylesheet execution. (STX matching is different from XSLT 1.0 in some important ways; read on.)|
|STX has a value-of element to place the value of an expression on the output.|
|STX uses a loosely-typed expression syntax derived from [XPath2.0], so the basic syntax is similar to the version of XPath ([XPath1.0]) used in XSLT 1.0.|
STX also has choose, if, param, and with-param that behave similarly to their XSLT 1.0 counterparts.
The most important difference between XSLT and STX is the difference between XPath 1.0 nodes and SAX events. The fundamental, atomic unit of operation in XPath 1.0 is a node that exists in the context of its containing document and inter-node relationships (parents, siblings, children, and so on). In contrast, the fundamental unit of operation in SAX is an event that exists without any additional context.
Programming at the level of SAX events is non-trivial and typically involves tracking the context of an event, directing the event to an appropriate handler based on its context, and performing an operation. To this end, an STX processor adds a thin layer of context to a stream of SAX events to create a notion of node robust enough to support a useful subset of XPath expressions -- called "STXPath" -- without requiring the creation of a DOM or similar tree-like data structure. The available context for a node in STXPath includes:
Current node. The current event along with all of its intrinsic information is accessible. The STX processor takes care of combining sequential characters() events into a single node.
Ancestor stack. The ancestor stack consists of the ancestors of the current node (i.e., the current node plus any startElement() events that haven't yet had the matching endElement() occur) together with their contexts.
Position counter. The position counter tracks the index of the current node among its siblings with respect to node type and node name particles (local name, prefix, and qname). The position index is exposed via the position() function and.
Lookahead. The STX processor always reads one node ahead so that the value of an immediately subsequent characters() event (if any) is included in the context for an element and made available as the value of the node.
STX does not include an analog of the apply-templates instruction from XSLT, since an STX processor receives SAX events in document order as opposed to being able to look for nodes in an object model. STX provides several different process-* instructions to control the handling of subsequent SAX events (with an optional group="..." specification; see below for more about groups).
Figure 1. A schematic for process-*
One way to think about the process-* instructions is as valves that let a certain set of events pass through part of the stylesheet, returning control to the current template once those events have been consumed.
The process-children instruction causes the STX processor to receive and handle the child nodes, if any, of the current node, and then return.
The process-attributes instruction causes the STX processor to handle the attributes on the current node, if any.
The process-siblings instruction causes the STX processor to handle the siblings of the current node with optional while="..." and until="..." expressions that allow it to return before all of the siblings have been consumed. (Note that the children of the original node and any consumed siblings will not be back, since STX processors can only process any given node once.) Combined with the ability to create separate startElement() and endElement() events (below), this gives STX powerful capabilities to restructure XML content.
The process-self instruction causes the STX processor to re-handle the current node, possibly using a different template group.
In contrast to side-effect-free XSLT, variables in STX are mutable and can be freely reassigned at any point just like variables in programming languages like Java, C, Python, and Perl. The example stylesheet uses this to maintain the sum during the execution of the stylesheet and to re-populate the name, quantity, and price variables for each successive order.
The example STX stylesheet illustrates some of the similarities and differences between STX and XSLT 1.0, but STX has a host of additional features.
Discrete startElement() and endElement()generation. Because an STX processor operates synchronously, it can generate literal events to the output, and the start-element and end-element will create the corresponding events on the output. This feature is useful when adding structure to a flat document; for example, here is a stylesheet that groups the line items in the plants example XML document by the first letter of the plant name, and here is the HTML output.
Template groups. STX allows templates to be formally grouped together under group elements. Groups can expose or conceal contained templates from matching, and shadowing of group-level variables is controlled at the template level with the new-scope="..." attribute..
Event buffers. STX provides FIFO event buffers that can be declared in stylesheets or within groups and follow the same hierarchical rules as variables. Buffers are populated by wrapping a portion of a template in a result-buffer element. Once the buffer is filled, a call to process-buffer routes the events in the buffer back through the input, where they are processed as though they had occurred in the source document.
Buffers are useful for shuffling document fragments. (For example, Oliver Becker implemented a bubble sort algorithm in STX.)
We have demonstrated that STX is an approachable means for transforming streaming XML with two usable implementations and an active, community-driven development effort.