XML Namespace Processing in Apache
The Apache 2 filter architecture serves to transform Apache from a mere web server into a powerful application platform. Applications that previously required a dedicated backend, typically Java-based, can now easily be implemented within the web server itself, with very substantial improvements in system performance.

Figure 1.
A few XML applications have become very well established, most obviously XSLT transformation. But the biggest advantages are to be gained by making full use of Apache's data pipelining and eliminating all network latency arising from the application. This is possible with a SAX-based parser that supports a parseChunk push API. Such an API is provided by XML libraries including expat and libxml2, which are also the fastest XML processors available anywhere according to the xmlbench results. This enables mean and lean markup-processing applications to be built with Apache. The overhead for processing XML (or indeed HTML) with SAX is low--at least comparable to server side include processing--and doesn't escalate as document size grows. This makes it an attractive alternative to XSLT and DOM-based techniques, as well as to expensive add ons (like PHP) or back ends (like Tomcat/Java-based systems) where scalability is a concern.
The downside of this approach is that it is a relatively low-level API
for building applications, requiring knowledge both of the Apache API
and of SAX processing--a less common skillset than Java/XML, XSLT, or
PHP. Hence, there are few working applications taking advantage of
this approach. Probably the best-known is
my mod_proxy_html, which rewrites URLs into a proxy's
address space and is an essential component of a reverse proxy. This
has proven, unexpectedly, to be my most popular module and is being
incorporated into an increasing number of OS distributions like
FreeBSD, Gentoo, and Debian.

Figure 2.
Figure 2 shows a namespace filter. It dispatches events to handlers keyed on namespace. A module may implement a namespace handler and hook it into the namespace filter.
The Apache Namespace Framework offers a simple, higher-level API.
The XML namespace module mod_xmlns implements an API for
XML namespace support in Apache, using a mean and lean fully pipelined
SAX2 parser implemented with expat. This is a simpler, higher-level
API specifically for namespace-based processing, offering various
advantages. Firstly, the task of implementing a namespace handler is
reduced to implementing a small number of SAX-like handlers.
Secondly, it provides a modular framework in which different namespace
processors can be introduced on a mix-and-match basis, without
reference to each other.
As a simple measure of how much module development is simplified, we can compare modules performing similar tasks with and without it. This is not entirely comparing like with like, but it gives a rough comparison:
mod_include (own parser): 3841 linesmod_xhtml (xmlns parser): 1518 linesmod_proxy_html (own parser): 1025 linesmod_proxy_xml (xmlns parser): 342 linesBefore introducing the details of the namespace API, let's get a sense of it by reviewing a few existing applications of namespace processing. Bear in mind that these can be combined arbitrarily according to needs.
There are now three different namespace handlers for XHTML (xmlns="http://www.w3.org/1999/xhtml"). The first is a trivial demonstrator that serves to ensure XHTML pages follow the W3C Appendix C compatibility rules and "work" as text/html with HTML browsers.
Apache's mod_include implements processing
directives in HTML comments. mod_include directives take the form <!--#directive var="value" ...-->
that maps
trivially to a namespace handler <ssi:directive
var="value"/>
in an XML context. mod_xhtml implements
server side includes both as a comment handler and as a separate
namespace, leaving it to users which form they prefer. This enables
SSI to be combined freely with (other) namespace-based applications,
without the overhead of parsing it a second time with mod_include.
ESI implements a set of processing directives that mix a namespace
with <!--esi ...--> comment-based directives. A
prototype ESI parser was published last year.
The Apache Tutor site specializes in intermediate to advanced tutorials for applications development with Apache. It includes an online editor and a facility for users to add comments (annotations), which are presented as margin notes. Articles are stored in an XML format, using a custom namespace for application-specific information, including article structure, ownership and permissions, revision and locking information, and annotations. The actual article contents are held as XHTML, and articles are served to browsers presented using namespace handlers.
mod_proxy_html cannot be implemented as a namespace handler because
it would not be acceptable to fail when the input is not well-formed
XML. However, a similar module mod_proxy_xml has recently been
released. Like mod_proxy_html, it serves to rewrite links into a
proxy's address space. It implements proxy namespace handlers for
XHTML and for WML and reduces the problem of implementing
reverse-proxying for another namespace to one of writing a single
function. As noted above, it is a good deal shorter and simpler than
mod_proxy_html, due in large part to using the simpler namespace API.
The most recent namespace handler module mod_sql implements a handler
for including SQL queries in XML pages. Working with a forms parsing
module and the Apache
DBD API, it takes full advantage of Apache's threaded MPMs with
connection pooling to provide a vastly more efficient and scalable
means of SQL access than traditional markup-based options such as PHP.
Before deploying a namespace handler, we need to install a
namespace filter module to export the API. There are currently two
options: mod_xmlns is the original implementation and is a minimal
module, while mod_publisher supports the API along with a wide range
of other markup options, as well as accepting HTML input and a wide
range of character encodings for I18N.
An important purpose of the public namespace API is to ensure that namespace modules are both source- and binary-compatible with either of the namespace parser modules and with any future implementations.
Any Apache module may implement a namespace in two simple steps:
The xmlns object is a struct comprising up to six elements, any of which may be null:
An additional element, an Attribute handler is likely to be added in a future update but is not yet supported.
Normally, the most important are the StartElement and EndElement handlers, which will be called for every element in the namespace we are handling. These are declared as:
int (*StartElement)(xmlns_public*, parsedname*, xmlns_attr_t*) ;
int (*EndElement)(xmlns_public*, parsedname*) ;
The StartNamespaceDecl and EndNamespaceDecl are suitable for any
necessary initialization and cleanup. The comment handlers are not
strictly part of namespace handling, but are often used to emulate
namespaces in technologies such as SSI (<!--#
... -->) and ESI (<!--esi ... -->), and
they benefit from similar treatment.
mod_xhtml(1) the simple XHTML Appendix-C handler is
static xmlns xmlns_xhtml_10 = {
xhtml_start ,
xhtml_end ,
NULL ,
NULL ,
NULL ,
NULL
} ;
(2) the handler for XHTML with SSI is
static xmlns xmlns_xhtml_ssi = {
xhtml_start ,
xhtml_end ,
ssi_init ,
ssi_term ,
"#" ,
ssi_comment
} ;
mod_xhtml registers these handlers with Apache using the ap_provider API:
static const char* XHTML10 = "http://www.w3.org/1999/xhtml" ;
static void xhtml_hooks(apr_pool_t* pool) {
ap_register_provider(pool, "xmlns", XHTML10, "1.0", &xmlns_xhtml10) ;
ap_register_provider(pool, "xmlns", XHTML10, "ssi", &xmlns_xhtml_ssi) ;
}
Now it is up to server administrators to configure their choice of handler for http://www.w3.org/1999/xhtml, depending on whether SSI support is required:
XMLNSUseNamespace http://www.w3.org/1999/xhtml on 1.0orXMLNSUseNamespace http://www.w3.org/1999/xhtml on ssi
To deactivate any processing of XHTML, change "on" to "off" in the above.
The SAX APIs (SAX, SAX2, and variants) in C supply a void*
userdata argument to each handler, for the use of applications
maintaining state between callbacks. The Apache Namespace API
supplies instead its own xmlns_public* argument, with accessor
functions for the userdata. The definition of this is
struct xmlns_public {
ap_filter_t* f ;
apr_bucket_brigade* bb ;
} xmlns_public ;
Apache programmers will instantly recognize f
and bb as the standard pointers passed to all filters.
They are required here primarily for use as descriptors when a handler
needs to generate output (e.g. using
ap_fputs, ap_fprintf and family), and f also provides
access to other apache data such as the request_rec and pool.
The data field is private to the namespace parser, but a conventional
void* pointer for the applications use is provided by the accessors:
void* xmlns_get_appdata(xmlns_public* ctx, const void* id) ;
void xmlns_set_appdata(xmlns_public* ctx, const void* id, void* value) ;
The id parameter is arbitrary and is used internally as a key to retrieve the context for our namespace handler,
from among (potentially) many different namespaces.
Also at the core of the API are the parsedname struct and the accessors for the Attributes in a StartElement handler:
typedef struct {
int nparts ; /* number of fields defined in this struct:
1: Only elt is defined
2: elt and ns are defined
3: elt, ns and prefix are defined
*/
const xml_char_t* ns ; /* xmlns prefix for this namespace */
size_t nslen ; /* Length of ns */
const xml_char_t* elt ; /* Element name */
size_t eltlen ; /* Length of elt */
const xml_char_t* prefix ; /* Currently-defined prefix for namespace */
size_t prefixlen ; /* Length of prefix */
} parsedname ;
The attribute pointer xmlns_attr_t is opaque, but accessors
are provided:
const xml_char_t* xmlns_get_attr_name(const xmlns_attr_t*, int) ;
const xml_char_t* xmlns_get_attr_val(const xmlns_attr_t*, int) ;
int xmlns_get_attr_parsed(const xmlns_attr_t*, int, parsedname*) ;
Namespace support in Apache is an evolving technology, with a small but growing number of applications. It is now in search of a broader developer community to create new applications and turn it into a mainstream platform for XML applications on the web.
The advantages it offers include:
mod_annot) and
alternative implementations of existing applications (like SSI),
and potentially extending to far bigger projects such as a native-C
implementation of JSP2 (with mod_gcj).The main reference for namespace processing is http://apache.webthing.com/. This includes full documentation of the API and all the modules discussed in this article, along with others of relevance. Most of the software is available for free download under the GNU General Public License (GPL). Brief documentation of the API is also available at apache.org.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.