XML Namespace Processing in Apache
December 15, 2004
The Apache 2 filter architecture serves to transform Apache from a mere web server into a powerful application platform. Applications that previously required a dedicated backend, typically Java-based, can now easily be implemented within the web server itself, with very substantial improvements in system performance.
Figure 1.
A few XML applications have become very well established, most obviously XSLT transformation. But the biggest advantages are to be gained by making full use of Apache's data pipelining and eliminating all network latency arising from the application. This is possible with a SAX-based parser that supports a parseChunk push API. Such an API is provided by XML libraries including expat and libxml2, which are also the fastest XML processors available anywhere according to the xmlbench results. This enables mean and lean markup-processing applications to be built with Apache. The overhead for processing XML (or indeed HTML) with SAX is low--at least comparable to server side include processing--and doesn't escalate as document size grows. This makes it an attractive alternative to XSLT and DOM-based techniques, as well as to expensive add ons (like PHP) or back ends (like Tomcat/Java-based systems) where scalability is a concern.
The downside of this approach is that it is a relatively low-level API for building
applications, requiring knowledge both of the Apache API and of SAX processing--a
less
common skillset than Java/XML, XSLT, or PHP. Hence, there are few working applications
taking advantage of this approach. Probably the best-known is my
mod_proxy_html
, which rewrites URLs into a proxy's address space and is an
essential component of a reverse proxy. This has proven, unexpectedly, to be my most
popular
module and is being incorporated into an increasing number of OS distributions like
FreeBSD,
Gentoo, and Debian.
The Apache Namespace API
Figure 2.
Figure 2 shows a namespace filter. It dispatches events to handlers keyed on namespace. A module may implement a namespace handler and hook it into the namespace filter.
The Apache Namespace Framework offers a simple, higher-level API. The XML namespace
module
mod_xmlns
implements an API for XML namespace support in Apache, using a mean
and lean fully pipelined SAX2 parser implemented with expat. This is a simpler, higher-level
API specifically for namespace-based processing, offering various advantages. Firstly,
the
task of implementing a namespace handler is reduced to implementing a small number
of
SAX-like handlers. Secondly, it provides a modular framework in which different namespace
processors can be introduced on a mix-and-match basis, without reference to each other.
As a simple measure of how much module development is simplified, we can compare modules performing similar tasks with and without it. This is not entirely comparing like with like, but it gives a rough comparison:
- Server-side includes:
-
mod_include
(own parser): 3841 linesmod_xhtml
(xmlns parser): 1518 lines
- Link rewriting for reverse proxy:
-
mod_proxy_html
(own parser): 1025 linesmod_proxy_xml
(xmlns parser): 342 lines
Example Applications
Before introducing the details of the namespace API, let's get a sense of it by reviewing a few existing applications of namespace processing. Bear in mind that these can be combined arbitrarily according to needs.
XHTML Appendix C Compatibility Rules
There are now three different namespace handlers for XHTML (xmlns="http://www.w3.org/1999/xhtml"). The first is a trivial demonstrator that serves to ensure XHTML pages follow the W3C Appendix C compatibility rules and "work" as text/html with HTML browsers.
Server Side Includes and Edge Side Includes
Apache's mod_include
implements processing directives in HTML comments.
mod_include
directives take the form <!--#directive var="value"
...-->
that maps trivially to a namespace handler
<ssi:directive var="value"/>
in an XML context.
mod_xhtml
implements server side includes both as a comment handler and as a
separate namespace, leaving it to users which form they prefer. This enables SSI to
be
combined freely with (other) namespace-based applications, without the overhead of
parsing
it a second time with mod_include
.
ESI implements a set of processing directives that mix a namespace with <!--esi
...-->
comment-based directives. A prototype ESI parser was published last year.
Scholarly Publication
The Apache Tutor site specializes in intermediate to advanced tutorials for applications development with Apache. It includes an online editor and a facility for users to add comments (annotations), which are presented as margin notes. Articles are stored in an XML format, using a custom namespace for application-specific information, including article structure, ownership and permissions, revision and locking information, and annotations. The actual article contents are held as XHTML, and articles are served to browsers presented using namespace handlers.
Reverse Proxy
mod_proxy_html
cannot be implemented as a namespace handler because it would
not be acceptable to fail when the input is not well-formed XML. However, a similar
module
mod_proxy_xml
has recently been released. Like mod_proxy_html
,
it serves to rewrite links into a proxy's address space. It implements proxy namespace
handlers for XHTML and for WML and reduces the problem of implementing reverse-proxying
for
another namespace to one of writing a single function. As noted above, it is a good
deal
shorter and simpler than mod_proxy_html
, due in large part to using the simpler
namespace API.
SQL and Forms Handling
The most recent namespace handler module mod_sql
implements a handler for
including SQL queries in XML pages. Working with a forms parsing module and the Apache DBD API, it takes full advantage of
Apache's threaded MPMs with connection pooling to provide a vastly more efficient
and
scalable means of SQL access than traditional markup-based options such as PHP.
Implementing a Namespace Handler
Prerequisites
Before deploying a namespace handler, we need to install a namespace filter module
to
export the API. There are currently two options: mod_xmlns
is the original
implementation and is a minimal module, while mod_publisher
supports the API
along with a wide range of other markup options, as well as accepting HTML input and
a wide
range of character encodings for I18N.
An important purpose of the public namespace API is to ensure that namespace modules are both source- and binary-compatible with either of the namespace parser modules and with any future implementations.
Implementation
Any Apache module may implement a namespace in two simple steps:
- Create an xmlns object
- Register it with Apache
The xmlns object is a struct comprising up to six elements, any of which may be null:
- SAX-like StartElement handler
- SAX-like EndElement handler
- SAX-like StartNamespaceDecl handler
- SAX-like EndNamespaceDecl handler
- Comment directive identifier
- SAX-like Comment handler
An additional element, an Attribute handler is likely to be added in a future update but is not yet supported.
Normally, the most important are the StartElement and EndElement handlers, which will be called for every element in the namespace we are handling. These are declared as:
int (*StartElement)(xmlns_public*, parsedname*, xmlns_attr_t*) ; int (*EndElement)(xmlns_public*, parsedname*) ;
The StartNamespaceDecl and EndNamespaceDecl are suitable for any necessary initialization
and cleanup. The comment handlers are not strictly part of namespace handling, but
are often
used to emulate namespaces in technologies such as SSI (<!--# ... -->
)
and ESI (<!--esi ... -->
), and they benefit from similar treatment.
Example: mod_xhtml
(1) the simple XHTML Appendix-C handler is
static xmlns xmlns_xhtml_10 = { xhtml_start , xhtml_end , NULL , NULL , NULL , NULL } ;
(2) the handler for XHTML with SSI is
static xmlns xmlns_xhtml_ssi = { xhtml_start , xhtml_end , ssi_init , ssi_term , "#" , ssi_comment } ;
mod_xhtml
registers these handlers with Apache using the ap_provider API:
static const char* XHTML10 = "http://www.w3.org/1999/xhtml" ; static void xhtml_hooks(apr_pool_t* pool) { ap_register_provider(pool, "xmlns", XHTML10, "1.0", &xmlns_xhtml10) ; ap_register_provider(pool, "xmlns", XHTML10, "ssi", &xmlns_xhtml_ssi) ; }
Now it is up to server administrators to configure their choice of handler for http://www.w3.org/1999/xhtml, depending on whether SSI support is required:
XMLNSUseNamespace http://www.w3.org/1999/xhtml on 1.0 or XMLNSUseNamespace http://www.w3.org/1999/xhtml on ssi
To deactivate any processing of XHTML, change "on" to "off" in the above.
The SAX APIs (SAX, SAX2, and variants) in C supply a void*
userdata argument
to each handler, for the use of applications maintaining state between callbacks.
The Apache
Namespace API supplies instead its own xmlns_public*
argument, with accessor
functions for the userdata. The definition of this is
struct xmlns_public { ap_filter_t* f ; apr_bucket_brigade* bb ; } xmlns_public ;
Apache programmers will instantly recognize f
and bb
as the
standard pointers passed to all filters. They are required here primarily for use
as
descriptors when a handler needs to generate output (e.g. using ap_fputs,
ap_fprintf
and family), and f
also provides access to other apache
data such as the request_rec
and pool. The data field is private to the
namespace parser, but a conventional void*
pointer for the applications use is
provided by the accessors:
void* xmlns_get_appdata(xmlns_public* ctx, const void* id) ; void xmlns_set_appdata(xmlns_public* ctx, const void* id, void* value) ;
The id
parameter is arbitrary and is used internally as a key to retrieve the
context for our namespace handler, from among (potentially) many different namespaces.
Also at the core of the API are the parsedname struct and the accessors for the Attributes in a StartElement handler:
typedef struct { int nparts ; /* number of fields defined in this struct: 1: Only elt is defined 2: elt and ns are defined 3: elt, ns and prefix are defined */ const xml_char_t* ns ; /* xmlns prefix for this namespace */ size_t nslen ; /* Length of ns */ const xml_char_t* elt ; /* Element name */ size_t eltlen ; /* Length of elt */ const xml_char_t* prefix ; /* Currently-defined prefix for namespace */ size_t prefixlen ; /* Length of prefix */ } parsedname ;
The attribute pointer xmlns_attr_t
is opaque, but accessors are provided:
const xml_char_t* xmlns_get_attr_name(const xmlns_attr_t*, int) ; const xml_char_t* xmlns_get_attr_val(const xmlns_attr_t*, int) ; int xmlns_get_attr_parsed(const xmlns_attr_t*, int, parsedname*) ;
Conclusion
Namespace support in Apache is an evolving technology, with a small but growing number of applications. It is now in search of a broader developer community to create new applications and turn it into a mainstream platform for XML applications on the web.
The advantages it offers include:
- A far lower processing overhead than traditional XML-based systems
- A fully pipelined architecture eliminating network latency
- A small and simple API for programming
- Flexible configuration with mix-and-match for server admins and webmasters
- Potential to develop an extensive library of off-the-shelf namespace handlers both
for
new applications (like
mod_annot
) and alternative implementations of existing applications (like SSI), and potentially extending to far bigger projects such as a native-C implementation of JSP2 (withmod_gcj
).
References
The main reference for namespace processing is http://apache.webthing.com/. This includes full documentation of the API and all the modules discussed in this article, along with others of relevance. Most of the software is available for free download under the GNU General Public License (GPL). Brief documentation of the API is also available at apache.org.