The Collected Works of SAX
July 18, 2001
Now more than three years old, SAX (Simple API for XML) is the oldest and most stable XML API in widespread use today. Yet despite its obvious utility it can be quite daunting to programmers making their first foray into manipulating XML documents. It's no surprise, then, than many appear to prefer using the DOM API in their early coding efforts, despite its many quirks and additional overhead.
A likely reason is that most tutorials introduce XML as a hierarchical data structure, which makes the DOM tree structure conceptually easier to understand initially. This is true even for Java programmers who might be expected to be more comfortable with event-oriented architectures, given their prevalence in Java APIs, with Swing being the obvious instance.
Another factor that leads developers to DOM (or variants like JDOM and dom4j), despite SAX's efficiency is the additional programming effort required to develop a SAX application. Such effort includes writing appropriate callback handlers, employing a state machine, and so on. In contrast, building a DOM is simple, and manipulating it is relatively simple too. So any effort to reduce some of SAX's additional overhead should be well received.
In a recent thread on XML-DEV, David Brownell invited developers to share their SAX "pet peeves", with the hope that a list of improvements might be compiled for a new backwards compatible release of the API. Starting the discussion with a few feature and property suggestions of his own, Brownell also observed that a library of SAX utility code might a useful addition.
I'm also tempted to put "no utility library" on that list. Stuff like an XML writer gets used by most folk that focus on SAX, and something to efficiently test whether characters are legal parts of XML (1.0 :) names. And there's lots of other stuff which, were it more generally available, would make it easier for folk to leverage SAX.
People responded to Brownell's posting with other proposals for improving SAX. Rob Lugt's proposal included allowing for automatic concatenation of character content and an alternate way of handling namespace mappings. Both of these would remove some burden from SAX applications and might be implemented as SAX filters or otherwise layered over the API.
It would be nice to see a standardized interface for supplying an object responsible for locating resources associated with a namespace. This is badly needed for parsers that support schema validation. Currently, parsers I've looked at either rely upon the "schemaLocation" attribute, or employ entirely proprietary interfaces requiring the writing of non-portable code.
Jonathan Borden has already produced a prototype API for RDDL that includes a SAX filter, but, as Brennan subsequently noted none of the available parsers provide modular support for this functionality.
Also high on the list of suggestions was support for XML Schemas. Jeff Rafter wanted to see support for the XML infoset and, layered above that, the Post-Schema-Validation-Infoset (PSVI).
- An interface which exposes Infoset items per the Infoset spec names... (This is basically what SAX does already...)
- An interface which layers onto the Infoset interface to expose PSVI -- this can be worked so that multiple schema languages can report their own PSVI...but obviously geared to XML Schemas. Something along this line would probably be very similar to a DTDHandler/DeclHandler and would be useful when someone wants to know the type of an attribute, for instance.
Generally speaking, SAX-filter-based schema validators would also be useful. David Brownell mentioned that he's already produced a DTD validating filter as part of the XML pipeline API included in the xmlconf project on Sourceforge.
Simon St. Laurent was also keen to see the formation of a standard library for SAX. He posted a list of useful suggestions.
...I'd love to see something along the lines of a SAX standard library. David Megginson's published an XMLWriter and a DataWriter and a fair number of SAX-based tools, but there are lots of possibilities which could be really useful.
- A router filter which sends events to different handlers based on namespace URI
- A suppression filter which obliterates markup (and perhaps content) from particular elements, attributes, or namespaces
- A division filter which sends the same event to multiple handlers, possibly even multiple threads
- More configurable Writer classes, including things like the RTF output handler for XT Eric van der Vlist announced today
- Reader classes for a few categories of non-XML input -- comma-separated might be a good place to start
Also in XML-Deviant
Forming a "collected works" (to borrow David Brownell's preferred term) of SAX-based utility code is certainly an interesting prospect. Despite its maturity there's no single collection of supporting pieces. Much of what is currently available is scattered across several open source projects under a number of different licenses. It's also likely that there is a lot of SAX1 code which could be dusted off and brought up to date with SAX2.
SAX has its own Sourceforge project now. There is, then, a central location at which donated code could accumulate and be managed. Because there are many applications built around SAX it's likely that a collected works could be compiled very quickly, if the development community is willing to make some small efforts, including generalizing existing code. The Deviant suggests that interested developers submit their own proposals or code to XML-DEV for review. The list has a history of organizing very quickly around practical projects, and this is one that could definitely have immediate benefits.