Inside Sablotron: Virtual XML Documents

March 13, 2002

Petr Cimprich

Despite the growing popularity of streaming XML processing, many applications still need or prefer to store an entire XML tree in memory for processing. The internal representation can either stick to the Document Object Model (DOM) or use any other convenient form. DOM-like optimized structures allow fast access to documents using the DOM API methods. On the other hand, the binary representation optimized for the DOM isn't well suited to different kinds of processing, such as XPath and XSLT. The reason is an incompatibility of the DOM and the XPath models: the DOM's "everything-is-a-node" approach isn't effective for the XPath and slows the resolution of queries down. This is why XPath and XSLT processors usually use their own internal representations rathern than DOM.

Whatever internal representation is used, one still needs a convenient interface to access it. The interface needn't be published, as it is typically used for internal purposes only; however, it's hard to imagine a maintainable and extensible XML processor implementation without a well-defined interface to its temporary storage. Beside the fast native interface optimized for specific purposes, the processor can also provide other, standard interfaces allowing to access to documents once they've been parsed. This is true of Sablotron, an Open Source XML processor project I'm currently involved in. I use it here to illustrate the possibilities of XML processors, some of them not deployed quite yet. But back to internals and interfaces; Sablotron uses its own in-memory objects and a set of interface functions optimized for XPath and XSLT, but parsed documents can be accessed and modified via a subset of the standard DOM API at the same time.

Taking it virtual

We have so far considered interfaces as a way to deal with parsed trees stored in memory, but we can envision an interface to structures other than an in-memory tree. An XML processor can be made to register a handler providing user-defined functions of its internal interface. This handler would then be recognized as an external analogy to a parsed XML document. We need not care about its real nature; as long as there is an interface defined via callback functions, we are able to process the virtual document in a manner similar to the internal ones. The processor design requires one more thin layer to be defined -- each task must be expressed in terms of generalized interface functions, calling either internal functions for internal documents or the handler functions for external documents. Given the complete functionality of the processor implemented using the generalized interface, we are then able to involve data managed by external handlers in XPath, XSLT, XQuery, and DOM operations, and so on. Even DOM access to virtual documents from XSLT extension functions is possible in principle.

Architecture of Virtual XML Documents

Now we are able to use virtual XML documents wherever common in-memory parsed documents can be used. The question remains: what is it good for? What kind of handlers could be used and why? We can implement handlers working as interfaces to other DOM processors, which would allow access to documents parsed and cached by third-party software. This sounds interesting, but the only result of this experiment would be to slow down processing.

On the other hand, some pretty useful external handlers could be employed. Consider a handler providing an access to XML documents stored in a RDBMS or native XML database. We would be able to perform XPath queries or transform those documents directly from the DB, without extracting whole documents. In the case of large XML documents, we can expect a significant acceleration of XPath queries and template-driven XSLT transformations. Moreover, we could work with persistent storage using standard XML technologies; good news for developers already familiar with these standards. This kind of handler would certainly be a useful feature, especially for the XML-enabled RDBMS.

Virtual XML documents generated dynamically from multiple sources or stored in several files appear to be another field of interest for our approach. External handlers also make it possible to process documents too large to be stored in memory; nodes aren't accessed before they are needed (if at all). An implementation of convenient handlers isn't always quite trivial, but it often pays in the long run. The benefits of XPath, XSLT, DOM and possibly other ways to deal with arbitrary XML tree representations are well worth implementing a few callback functions.

Working with Sablotron, we have experimented with external handlers to see if it really works. Sablotron enables users to register a handler, a set of callback functions, and to evaluate XPath queries on virtual documents accessed via those functions. A low-level interface implemented in C makes it possible to define callbacks, a query context (namespace declarations and variable bindings), and the current node for the query. This feature (called SXP, Sablotron XPath Processor) is well tested and can be used in production systems. The processor core is also ready to extend the support of external documents to XSLT. However, since the DOM interface to Sablotron isn't implemented using the generalized interface, a DOM API for external documents isn't available currently.

In summary, since the interface working as a base for XPath querying and XSLT transformations can be replaced with user-defined callback functions, external handlers can be used to get an arbitrary XML representation passed to XPath/XSLT directly. What this approach promises is a notable speed increase and a memory consumption decrease when compared to building whole documents. If you would like to experiment with this, I invite you to try out Sablotron. I'm not aware of any other XML processor supporting external handlers currently; information on a similar effort or your experiences with the XPath/XSLT/DOM via callbacks is welcomed.