XML Namespace Processing in Apache

December 15, 2004

The Apache 2 filter architecture serves to transform Apache from a mere web server into a powerful application platform. Applications that previously required a dedicated backend, typically Java-based, can now easily be implemented within the web server itself, with very substantial improvements in system performance.

Figure 1: Apache 2 introduces a new Data axis enabling a new range of powerful applications
Figure 1.

A few XML applications have become very well established, most obviously XSLT transformation. But the biggest advantages are to be gained by making full use of Apache's data pipelining and eliminating all network latency arising from the application. This is possible with a SAX-based parser that supports a parseChunk push API. Such an API is provided by XML libraries including expat and libxml2, which are also the fastest XML processors available anywhere according to the xmlbench results. This enables mean and lean markup-processing applications to be built with Apache. The overhead for processing XML (or indeed HTML) with SAX is low--at least comparable to server side include processing--and doesn't escalate as document size grows. This makes it an attractive alternative to XSLT and DOM-based techniques, as well as to expensive add ons (like PHP) or back ends (like Tomcat/Java-based systems) where scalability is a concern.

The downside of this approach is that it is a relatively low-level API for building applications, requiring knowledge both of the Apache API and of SAX processing--a less common skillset than Java/XML, XSLT, or PHP. Hence, there are few working applications taking advantage of this approach. Probably the best-known is my mod_proxy_html, which rewrites URLs into a proxy's address space and is an essential component of a reverse proxy. This has proven, unexpectedly, to be my most popular module and is being incorporated into an increasing number of OS distributions like FreeBSD, Gentoo, and Debian.

The Apache Namespace API

Figure 2.

Figure 2 shows a namespace filter. It dispatches events to handlers keyed on namespace. A module may implement a namespace handler and hook it into the namespace filter.

The Apache Namespace Framework offers a simple, higher-level API. The XML namespace module mod_xmlns implements an API for XML namespace support in Apache, using a mean and lean fully pipelined SAX2 parser implemented with expat. This is a simpler, higher-level API specifically for namespace-based processing, offering various advantages. Firstly, the task of implementing a namespace handler is reduced to implementing a small number of SAX-like handlers. Secondly, it provides a modular framework in which different namespace processors can be introduced on a mix-and-match basis, without reference to each other.

As a simple measure of how much module development is simplified, we can compare modules performing similar tasks with and without it. This is not entirely comparing like with like, but it gives a rough comparison:

Server-side includes:

mod_include (own parser): 3841 lines
mod_xhtml (xmlns parser): 1518 lines

Link rewriting for reverse proxy:

mod_proxy_html (own parser): 1025 lines
mod_proxy_xml (xmlns parser): 342 lines

Example Applications

Before introducing the details of the namespace API, let's get a sense of it by reviewing a few existing applications of namespace processing. Bear in mind that these can be combined arbitrarily according to needs.

XHTML Appendix C Compatibility Rules

There are now three different namespace handlers for XHTML (xmlns="http://www.w3.org/1999/xhtml"). The first is a trivial demonstrator that serves to ensure XHTML pages follow the W3C Appendix C compatibility rules and "work" as text/html with HTML browsers.

Server Side Includes and Edge Side Includes

Apache's mod_include implements processing directives in HTML comments. mod_include directives take the form

that maps trivially to a namespace handler
<ssi:directive var="value"/>
in an XML context. mod_xhtml implements server side includes both as a comment handler and as a separate namespace, leaving it to users which form they prefer. This enables SSI to be combined freely with (other) namespace-based applications, without the overhead of parsing it a second time with mod_include.

ESI implements a set of processing directives that mix a namespace with  comment-based directives. A prototype ESI parser was published last year.

Scholarly Publication

The Apache Tutor site specializes in intermediate to advanced tutorials for applications development with Apache. It includes an online editor and a facility for users to add comments (annotations), which are presented as margin notes. Articles are stored in an XML format, using a custom namespace for application-specific information, including article structure, ownership and permissions, revision and locking information, and annotations. The actual article contents are held as XHTML, and articles are served to browsers presented using namespace handlers.

Reverse Proxy

mod_proxy_html cannot be implemented as a namespace handler because it would not be acceptable to fail when the input is not well-formed XML. However, a similar module mod_proxy_xml has recently been released. Like mod_proxy_html, it serves to rewrite links into a proxy's address space. It implements proxy namespace handlers for XHTML and for WML and reduces the problem of implementing reverse-proxying for another namespace to one of writing a single function. As noted above, it is a good deal shorter and simpler than mod_proxy_html, due in large part to using the simpler namespace API.

SQL and Forms Handling

The most recent namespace handler module mod_sql implements a handler for including SQL queries in XML pages. Working with a forms parsing module and the Apache DBD API, it takes full advantage of Apache's threaded MPMs with connection pooling to provide a vastly more efficient and scalable means of SQL access than traditional markup-based options such as PHP.

Implementing a Namespace Handler

Prerequisites

Before deploying a namespace handler, we need to install a namespace filter module to export the API. There are currently two options: mod_xmlns is the original implementation and is a minimal module, while mod_publisher supports the API along with a wide range of other markup options, as well as accepting HTML input and a wide range of character encodings for I18N.

An important purpose of the public namespace API is to ensure that namespace modules are both source- and binary-compatible with either of the namespace parser modules and with any future implementations.

Implementation

Any Apache module may implement a namespace in two simple steps:

Create an xmlns object
Register it with Apache

The xmlns object is a struct comprising up to six elements, any of which may be null:

SAX-like StartElement handler
SAX-like EndElement handler
SAX-like StartNamespaceDecl handler
SAX-like EndNamespaceDecl handler
Comment directive identifier
SAX-like Comment handler

An additional element, an Attribute handler is likely to be added in a future update but is not yet supported.

Normally, the most important are the StartElement and EndElement handlers, which will be called for every element in the namespace we are handling. These are declared as:


    int (*StartElement)(xmlns_public*, parsedname*, xmlns_attr_t*) ;

    int (*EndElement)(xmlns_public*, parsedname*) ;

The StartNamespaceDecl and EndNamespaceDecl are suitable for any necessary initialization and cleanup. The comment handlers are not strictly part of namespace handling, but are often used to emulate namespaces in technologies such as SSI () and ESI (), and they benefit from similar treatment.

Example: `mod_xhtml`

(1) the simple XHTML Appendix-C handler is


static xmlns xmlns_xhtml_10 = {

	xhtml_start ,

	xhtml_end ,

	NULL ,

	NULL ,

	NULL ,

	NULL

} ;

(2) the handler for XHTML with SSI is


static xmlns xmlns_xhtml_ssi = {

	xhtml_start ,

	xhtml_end ,

	ssi_init ,

	ssi_term ,

	"#" ,

	ssi_comment

} ;

mod_xhtml registers these handlers with Apache using the ap_provider API:


static const char* XHTML10 = "http://www.w3.org/1999/xhtml" ;



static void xhtml_hooks(apr_pool_t* pool) {

  ap_register_provider(pool, "xmlns", XHTML10, "1.0", &xmlns_xhtml10) ;

  ap_register_provider(pool, "xmlns", XHTML10, "ssi", &xmlns_xhtml_ssi) ;

}

Now it is up to server administrators to configure their choice of handler for http://www.w3.org/1999/xhtml, depending on whether SSI support is required:


XMLNSUseNamespace	http://www.w3.org/1999/xhtml on 1.0

	or

XMLNSUseNamespace	http://www.w3.org/1999/xhtml on ssi

To deactivate any processing of XHTML, change "on" to "off" in the above.

The SAX APIs (SAX, SAX2, and variants) in C supply a void* userdata argument to each handler, for the use of applications maintaining state between callbacks. The Apache Namespace API supplies instead its own xmlns_public* argument, with accessor functions for the userdata. The definition of this is


struct xmlns_public {

  ap_filter_t* f ;

  apr_bucket_brigade* bb ;

} xmlns_public ;

Apache programmers will instantly recognize f and bb as the standard pointers passed to all filters. They are required here primarily for use as descriptors when a handler needs to generate output (e.g. using ap_fputs, ap_fprintf and family), and f also provides access to other apache data such as the request_rec and pool. The data field is private to the namespace parser, but a conventional void* pointer for the applications use is provided by the accessors:


  void* xmlns_get_appdata(xmlns_public* ctx, const void* id) ;

  void xmlns_set_appdata(xmlns_public* ctx, const void* id, void* value) ;

The id parameter is arbitrary and is used internally as a key to retrieve the context for our namespace handler, from among (potentially) many different namespaces.

Also at the core of the API are the parsedname struct and the accessors for the Attributes in a StartElement handler:


typedef struct {

  int nparts ;                  /* number of fields defined in this struct:

                                        1: Only elt is defined

                                        2: elt and ns are defined

                                        3: elt, ns and prefix are defined

                                */

  const xml_char_t* ns ;        /* xmlns prefix for this namespace */

  size_t nslen ;                /* Length of ns */

  const xml_char_t* elt ;       /* Element name */

  size_t eltlen ;               /* Length of elt */

  const xml_char_t* prefix ;    /* Currently-defined prefix for namespace */

  size_t prefixlen ;            /* Length of prefix */

} parsedname ;

The attribute pointer xmlns_attr_t is opaque, but accessors are provided:


  const xml_char_t* xmlns_get_attr_name(const xmlns_attr_t*, int) ;

  const xml_char_t* xmlns_get_attr_val(const xmlns_attr_t*, int) ;

  int xmlns_get_attr_parsed(const xmlns_attr_t*, int, parsedname*) ;

Conclusion

Namespace support in Apache is an evolving technology, with a small but growing number of applications. It is now in search of a broader developer community to create new applications and turn it into a mainstream platform for XML applications on the web.

The advantages it offers include:

A far lower processing overhead than traditional XML-based systems
A fully pipelined architecture eliminating network latency
A small and simple API for programming
Flexible configuration with mix-and-match for server admins and webmasters
Potential to develop an extensive library of off-the-shelf namespace handlers both for new applications (like mod_annot) and alternative implementations of existing applications (like SSI), and potentially extending to far bigger projects such as a native-C implementation of JSP2 (with mod_gcj).

References

The main reference for namespace processing is http://apache.webthing.com/. This includes full documentation of the API and all the modules discussed in this article, along with others of relevance. Most of the software is available for free download under the GNU General Public License (GPL). Brief documentation of the API is also available at apache.org.