Using XML Catalogs with JAXP
March 3, 2004
XML documents often refer to other documents that an XML processor has to retrieve in order to make sense of the main document. These external resources, typically referred to by URIs, may be local files; or they may be remote, distributed across the web. In an ideal world the difference would be invisible, since it would be as cheap to access a remote resource as a local one. However, in the real world network failures do occur, and it is wise to design applications that take this into account.
XML Catalogs offer a way to manage local copies of public DTDs, schemas, or indeed any XML resource that exists outside of the referring XML instance document. Rather than modifying the XML instance document to refer directly to a local copy, you leave the reference to the remote resource and write an XML Catalog that maps remote references to local resources. Your application then installs a resolver, whose job it is to consult the catalog whenever an external resource is needed. The Apache xml-commons project's Resolver package, from Norman Walsh, is a collection of Java classes for working with XML Catalogs. This article looks at how to use the Resolver classes with JAXP by working through three XML processing examples that cover the main capabilities of XML Catalogs.
XML Catalogs is currently an OASIS Committee Specification, which is a draft specification on track to becoming an OASIS Standard. It is a direct descendent of work done on catalogs for SGML systems, the current standard being the OASIS Technical Resolution TR9401 plain-text catalog format. This standard can also be used for XML applications; indeed the xml-commons Resolver supports TR9401 catalogs too, although they are not covered in this article.
Example 1: Offline Validation of XHTML Pages
For the first example, let's look at a common situation where XML Catalogs are useful: in providing a local copy of a DTD. Suppose you want to check that a page is valid XHTML -- before you put it on your website, for example. Here's a sample XHTML page to be checked:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>OMELETTE</title> </head> <body> <h1>Omelette by Elizabeth David</h1> <h2>Ingredients</h2> <ul> <li>3-4 eggs</li> <li>1/2 oz. butter</li> <li>Salt and pepper</li> </ul> <h2>Method</h2> <p>Beat the eggs...</p> </body> </html>
The obvious way to perform the check from Java would be to use an event-based parser, such as the JAXP SAX parser shown here:
SAXParserFactory factory = SAXParserFactory.newInstance(); factory.setNamespaceAware(true); factory.setValidating(true); SAXParser parser = factory.newSAXParser(); XMLReader reader = parser.getXMLReader(); reader.setErrorHandler(new DefaultErrorHandler()); reader.parse(inputSource);
DefaultErrorHandler
is an implementation of
org.xml.sax.ErrorHandler
that prints warnings to standard error, and throws
exceptions when errors or fatal errors occur during parsing. Since the parser is validating
the XHTML document against the declared DOCTYPE, it will retrieve the DTD from W3C's
site at
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. (It is worth noting as an aside
that the DTD may be retrieved even if the parser is not validating, as this part of the XML
spec explains.) For some applications this might not be a problem, but others might
not have the luxury of a permanent net connection -- a J2ME Connected Limited Device
Configuration, for instance. Even if a net connection is available it might be slow,
causing
the page checker to be unacceptably slow; or the resource might not be available (if
W3C's
site is down), causing the page checker to break.
We can solve all these potential problems by using an catalog. A catalog is made up of one or more catalog entry files. Here is the simplest catalog entry file, called catalog.xml, that can be used to resolve the public identifier for an XHTML document to a local copy of the XHTML 1.0 DTD:
<?xml version="1.0" encoding="UTF-8"?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <public publicId="-//W3C//DTD XHTML 1.0 Strict//EN" uri="catalog/xhtml1-strict.dtd"/> </catalog>
A catalog entry file is made up of a number of catalog entries. This one has a single
public
entry that describes a mapping between the public identifier of an
entity -- in this case -//W3C//DTD XHTML 1.0 Strict//EN -- and a preferred URI to
locate the entity -- in this case the file catalog/xhtml1-strict.dtd relative to
catalog.xml. You need to manually download the DTD (and the referenced external
entity files for XHTML) and put it in the correct local directory; the catalog simply
provides the mapping, it doesn't provide automatic caching facilities.
To plug the catalog into our application we need to use the Apache xml-commons project's Resolver component.
For a JAXP application, the key class is
org.apache.xml.resolver.tools.CatalogResolver
, an implementation of
org.xml.sax.EntityResolver
, which as the name suggests is the interface JAXP
parsers use to customize handling of external entities. To register the resolver,
call the
setEntityResolver()
method on the SAX XMLReader
instance,
passing in a new instance of CatalogResolver
. (Similarly, in the case of a JAXP
DOM parser, the CatalogResolver
is set on the DocumentBuilder
using the setEntityResolver()
method.)
SAXParserFactory factory = SAXParserFactory.newInstance(); factory.setNamespaceAware(true); factory.setValidating(true); SAXParser parser = factory.newSAXParser(); XMLReader reader = parser.getXMLReader(); reader.setEntityResolver(new CatalogResolver()); reader.setErrorHandler(new DefaultErrorHandler()); reader.parse(inputSource);
But how does the CatalogResolver
find XML Catalog entry files? One way to
configure this is by setting the system property xml.catalog.files
to a
semicolon-separated list of catalog entry files; by passing a command-line property
to the
Java Virtual Machine, for example -Dxml.catalog.files=/catalog/catalog.xml
.
However, using an absolute path is best avoided since it restricts the portability
of your
application. Web applications, for instance, should be written in such a way as not
to
depend on where they are deployed on the filesystem, as this is typically out of their
control.
A better way to specify catalogs is to provide a properties file with a relative
path to
the catalog entry files. CatalogResolver
uses a CatalogManager
class that automatically looks for a properties file called CatalogManager.properties
on the classpath. The following properties file achieves the same effect as setting
the
system property xml.catalog.files
:
# Catalogs are relative to this properties file relative-catalogs=false # Catalog list catalogs=catalog.xml
Notice that the property relative-catalogs
is set to false
,
which may seem a little counter intuitive. If relative-catalogs
is set to
true
then the filenames that appear in the catalogs
property are
left unchanged, so a relative path will be relative to the current directory of the
JVM. On
the other hand, if set to false
, relative paths are made absolute with respect
to the CatalogManager.properties file. A full list of properties and their behavior
is fully described in the API documentation for CatalogManager
.
Finally, we can run the page checker application offline since the
EntityResolver
will use the local catalog to load the DTD. To prove that no
net connection is required, I have written a JUnit test that runs with a security
manager
that blocks all net access. This test, along with all the other examples in this article,
is
available in the download.
Example 2: W3C XML Schema Validation
In the same way that an XML document may associate itself with a DTD via the
DOCTYPE
declaration, an XML document may associate itself with a W3C XML
Schema using a schema location hint. This example looks at how to validate a document
against a schema specified in this way.
A schema location hint is an xsi:schemaLocation
attribute on an element --
typically the root -- whose value is a list of namespace URIs and URIs for the schemas
to
validate elements in those namespaces. Alternatively, if the elements are not in a
namespace, a schema location hint is an xsi:noNamespaceSchemaLocation
attribute
whose value is a URI for the schema. The xsi
prefix is bound to the
http://www.w3.org/2001/XMLSchema-instance namespace URI.
For example, here is an XML instance document that describes a recipe, and declares itself to be valid with respect to the schema located at http://tiling.org/xmlcatalogs/schemas/recipe.xsd in the http://tiling.org/xmlcatalogs/namespaces/recipe namespace:
<?xml version="1.0" encoding="UTF-8"?> <recipe xmlns="http://tiling.org/xmlcatalogs/namespaces/recipe" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation= "http://tiling.org/xmlcatalogs/namespaces/recipe http://tiling.org/xmlcatalogs/schemas/recipe.xsd"> <author>Elizabeth David</author> <name>Omelette</name> <ingredient>3-4 eggs</ingredient> <ingredient>1/2 oz. butter</ingredient> <ingredient>Salt and pepper</ingredient> <method>Beat the eggs...</method> </recipe>
Although not explicitly marked as a system identifier we can use a catalog with a
system
element to associate the schema with a local copy.
<?xml version="1.0" encoding="UTF-8"?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <system systemId="http://tiling.org/xmlcatalogs/schemas/recipe.xsd" uri="catalog/recipe.xsd"/> </catalog>
Then we can use the same JAXP SAX code as before -- with one important change --
to
validate the XML instance document using the local schema. The only change needed
is to tell
JAXP which schema language to use when performing validation. In this case it is W3C
XML
Schema, which is configured by setting a property on the SAXParser
, as show
below. Note that if the JAXP parser you are using does not implement specification
version
1.2 or later, then attempting to set the property will fail by throwing an
IllegalArgumentException
. (It is worth mentioning in passing that for a DOM
parser you set the same property name and value by calling the setAttribute()
method on the DocumentBuilderFactory
.)
SAXParserFactory factory = SAXParserFactory.newInstance(); factory.setNamespaceAware(true); factory.setValidating(true); SAXParser parser = factory.newSAXParser(); parser.setProperty( "http://java.sun.com/xml/jaxp/properties/schemaLanguage", "http://www.w3.org/2001/XMLSchema" ); XMLReader reader = parser.getXMLReader(); reader.setEntityResolver(new CatalogResolver()); reader.setErrorHandler(new DefaultErrorHandler()); reader.parse(inputSource);
Another benefit that catalogs offer, in addition to protection from network failure,
is
the ability to substitute a public resource with a local one that better fits your
particular application's needs. For example, in the case of schema validation, it
might be
useful to validate against a local schema that imposes stronger constraints than the
public
one. Another way of achieving this effect -- but only in the case of schema validation
-- is
by explicitly instructing the parser to validate against a given schema; effectively
overriding the schema location hint. Just set the property
http://java.sun.com/xml/jaxp/properties/schemaSource
to a value specifying
the schema to use. This is explained in detail in the JAXP 1.2 maintanence
specification.
Example 3: Remote Stylesheet Inclusions
For the third example of catalogs in action, we turn to XSLT transforms and see how
one
stylesheet can include or import another. The xsl:include
instruction, which
the XSLT processor replaces with the contents of the referenced stylesheet, allows
stylesheet authors to split stylesheets into modular documents. For example, the following
skeleton stylesheet for transforming the recipe XML file in the previous section into
XHTML
includes a set of public XSLT utilities.
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns="http://www.w3.org/1999/xhtml" xmlns:r="http://tiling.org/xmlcatalogs/namespaces/recipe" exclude-result-prefixes="r"> <xsl:include href="http://tiling.org/xmlcatalogs/xslt/utils.xslt"/> ... <xsl:template match="r:recipe"> ... </xsl:template> </xsl:stylesheet>
This time the catalog uses a uri
element to specify the match for the
included file reference:
<?xml version="1.0" encoding="UTF-8"?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <uri name="http://tiling.org/xmlcatalogs/xslt/utils.xslt" uri="catalog/utils.xslt"/> </catalog>
JAXP provides an interface called javax.xml.transform.URIResolver
that allows
applications to intercept calls to the xsl:include
and xsl:import
instruction, and the document()
function. CatalogResolver
implements this interface too, using the URI mappings from its catalog to resolve
resources.
So in the transform code we simply call the setURIResolver()
method on the
TransformerFactory
, passing in an instance of CatalogResolver
.
Then we can create a new Transformer
instance, and it will be set up to use the
local file utils.xslt.
TransformerFactory factory = TransformerFactory.newInstance(); factory.setURIResolver(new CatalogResolver()); Transformer transformer = factory.newTransformer(stylesheetSource); StringWriter writer = new StringWriter(); StreamResult result = new StreamResult(writer); transformer.transform(inputStreamSource, result);
Developing More Complex Catalogs
XML Catalogs offer several other useful features. For instance, you can delegate a match to another catalog; and you can chain catalogs together using the nextCatalog
element. Also useful is
the ability to map a set of mirrored resources using a single rewrite entry, as the
following catalog entry file illustrates.
<?xml version="1.0" encoding="UTF-8"?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <rewriteSystem systemIdStartString="http://tiling.org/xmlcatalogs/schemas/" rewritePrefix="catalog/"/> </catalog>
The rewriteSystem
instructs the resolver to replace the start string for any
matching system identifier with the given prefix. In this case, all schemas that begin
with
the string http://tiling.org/xmlcatalogs/schemas/ are mirrored in the local directory
catalog/ relative to the catalog entry file.
The XML Catalogs we have seen so far have each consisted of just a single entry file
with
a single entry. An XML Catalog can be made up of a list of catalog entry files, each
considered in turn, although subsequent files are not consulted if a match is found
in an
earlier file. Within each catalog entry file there are rules that govern resolution
-- for a
full list, see the specification. For example, system
entries are considered for matching
before rewriteSystem
entries.
When developing larger catalogs an identifier may not be resolved to the URI you
expect.
It can pay to write unit tests that test resolution, perhaps by restricting net access
(like
the examples that accompany this article). Even with tests,
however, diagnostic tools can be useful. The simplest way to see what is going on
during
resolution is to set the CatalogManager
property verbosity
to a
non-zero number: the higher the number the more information you get.
You can manually try resolution from the command line using the resolver application that is supplied in the Resolver package. The following session shows resolution of an XHTML DOCTYPE, such as the one in the first example at the beginning of this article.
$ java -jar lib/resolver.jar -c catalog.xml \ -p "-//W3C//DTD XHTML 1.0 Strict//EN" \ -s http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd doctype Cannot find CatalogManager.properties Resolve DOCTYPE (name, publicid, systemid): public id: -//W3C//DTD XHTML 1.0 Strict//EN system id: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd Result: file:/tom/workspace/xmlcatalogs/catalog/xhtml1-strict.dtd
Conclusion
Using XML Catalogs to manage a local store of external resources can make your JAXP applications more robust and faster by removing the dependency on the network. Furthermore, XML Catalogs is a standard with ever increasing support -- for example, the recently released Ant 1.6 supports XML Catalogs -- so it is easy to reuse your catalog entry files.
Resources
|