Menu

Using XML Catalogs with JAXP

March 3, 2004

Tom White

XML documents often refer to other documents that an XML processor has to retrieve in order to make sense of the main document. These external resources, typically referred to by URIs, may be local files; or they may be remote, distributed across the web. In an ideal world the difference would be invisible, since it would be as cheap to access a remote resource as a local one. However, in the real world network failures do occur, and it is wise to design applications that take this into account.

XML Catalogs offer a way to manage local copies of public DTDs, schemas, or indeed any XML resource that exists outside of the referring XML instance document. Rather than modifying the XML instance document to refer directly to a local copy, you leave the reference to the remote resource and write an XML Catalog that maps remote references to local resources. Your application then installs a resolver, whose job it is to consult the catalog whenever an external resource is needed. The Apache xml-commons project's Resolver package, from Norman Walsh, is a collection of Java classes for working with XML Catalogs. This article looks at how to use the Resolver classes with JAXP by working through three XML processing examples that cover the main capabilities of XML Catalogs.

XML Catalogs is currently an OASIS Committee Specification, which is a draft specification on track to becoming an OASIS Standard. It is a direct descendent of work done on catalogs for SGML systems, the current standard being the OASIS Technical Resolution TR9401 plain-text catalog format. This standard can also be used for XML applications; indeed the xml-commons Resolver supports TR9401 catalogs too, although they are not covered in this article.

Example 1: Offline Validation of XHTML Pages

For the first example, let's look at a common situation where XML Catalogs are useful: in providing a local copy of a DTD. Suppose you want to check that a page is valid XHTML -- before you put it on your website, for example. Here's a sample XHTML page to be checked:


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
  xml:lang="en" lang="en">
  <head>
    <title>OMELETTE</title>
  </head>
  <body>
    <h1>Omelette by Elizabeth David</h1>
    <h2>Ingredients</h2>
    <ul>
      <li>3-4 eggs</li>
      <li>1/2 oz. butter</li>
      <li>Salt and pepper</li>
    </ul>
    <h2>Method</h2>
    <p>Beat the eggs...</p>
  </body>
</html> 

The obvious way to perform the check from Java would be to use an event-based parser, such as the JAXP SAX parser shown here:

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
factory.setValidating(true);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new DefaultErrorHandler());        
reader.parse(inputSource);

DefaultErrorHandler is an implementation of org.xml.sax.ErrorHandler that prints warnings to standard error, and throws exceptions when errors or fatal errors occur during parsing. Since the parser is validating the XHTML document against the declared DOCTYPE, it will retrieve the DTD from W3C's site at http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. (It is worth noting as an aside that the DTD may be retrieved even if the parser is not validating, as this part of the XML spec explains.) For some applications this might not be a problem, but others might not have the luxury of a permanent net connection -- a J2ME Connected Limited Device Configuration, for instance. Even if a net connection is available it might be slow, causing the page checker to be unacceptably slow; or the resource might not be available (if W3C's site is down), causing the page checker to break.

We can solve all these potential problems by using an catalog. A catalog is made up of one or more catalog entry files. Here is the simplest catalog entry file, called catalog.xml, that can be used to resolve the public identifier for an XHTML document to a local copy of the XHTML 1.0 DTD:

<?xml version="1.0" encoding="UTF-8"?>
<catalog
  xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <public publicId="-//W3C//DTD XHTML 1.0 Strict//EN"
    uri="catalog/xhtml1-strict.dtd"/>

</catalog>
    

A catalog entry file is made up of a number of catalog entries. This one has a single public entry that describes a mapping between the public identifier of an entity -- in this case -//W3C//DTD XHTML 1.0 Strict//EN -- and a preferred URI to locate the entity -- in this case the file catalog/xhtml1-strict.dtd relative to catalog.xml. You need to manually download the DTD (and the referenced external entity files for XHTML) and put it in the correct local directory; the catalog simply provides the mapping, it doesn't provide automatic caching facilities.

To plug the catalog into our application we need to use the Apache xml-commons project's Resolver component. For a JAXP application, the key class is org.apache.xml.resolver.tools.CatalogResolver, an implementation of org.xml.sax.EntityResolver, which as the name suggests is the interface JAXP parsers use to customize handling of external entities. To register the resolver, call the setEntityResolver() method on the SAX XMLReader instance, passing in a new instance of CatalogResolver. (Similarly, in the case of a JAXP DOM parser, the CatalogResolver is set on the DocumentBuilder using the setEntityResolver() method.)

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
factory.setValidating(true);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setEntityResolver(new CatalogResolver());
reader.setErrorHandler(new DefaultErrorHandler());        
reader.parse(inputSource);
    

But how does the CatalogResolver find XML Catalog entry files? One way to configure this is by setting the system property xml.catalog.files to a semicolon-separated list of catalog entry files; by passing a command-line property to the Java Virtual Machine, for example -Dxml.catalog.files=/catalog/catalog.xml. However, using an absolute path is best avoided since it restricts the portability of your application. Web applications, for instance, should be written in such a way as not to depend on where they are deployed on the filesystem, as this is typically out of their control.

A better way to specify catalogs is to provide a properties file with a relative path to the catalog entry files. CatalogResolver uses a CatalogManager class that automatically looks for a properties file called CatalogManager.properties on the classpath. The following properties file achieves the same effect as setting the system property xml.catalog.files:

# Catalogs are relative to this properties file
relative-catalogs=false
# Catalog list
catalogs=catalog.xml
    

Notice that the property relative-catalogs is set to false, which may seem a little counter intuitive. If relative-catalogs is set to true then the filenames that appear in the catalogs property are left unchanged, so a relative path will be relative to the current directory of the JVM. On the other hand, if set to false, relative paths are made absolute with respect to the CatalogManager.properties file. A full list of properties and their behavior is fully described in the API documentation for CatalogManager.

Finally, we can run the page checker application offline since the EntityResolver will use the local catalog to load the DTD. To prove that no net connection is required, I have written a JUnit test that runs with a security manager that blocks all net access. This test, along with all the other examples in this article, is available in the download.

Example 2: W3C XML Schema Validation

In the same way that an XML document may associate itself with a DTD via the DOCTYPE declaration, an XML document may associate itself with a W3C XML Schema using a schema location hint. This example looks at how to validate a document against a schema specified in this way.

A schema location hint is an xsi:schemaLocation attribute on an element -- typically the root -- whose value is a list of namespace URIs and URIs for the schemas to validate elements in those namespaces. Alternatively, if the elements are not in a namespace, a schema location hint is an xsi:noNamespaceSchemaLocation attribute whose value is a URI for the schema. The xsi prefix is bound to the http://www.w3.org/2001/XMLSchema-instance namespace URI.

For example, here is an XML instance document that describes a recipe, and declares itself to be valid with respect to the schema located at http://tiling.org/xmlcatalogs/schemas/recipe.xsd in the http://tiling.org/xmlcatalogs/namespaces/recipe namespace:

<?xml version="1.0" encoding="UTF-8"?>
<recipe
  xmlns="http://tiling.org/xmlcatalogs/namespaces/recipe"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation=
    "http://tiling.org/xmlcatalogs/namespaces/recipe
    http://tiling.org/xmlcatalogs/schemas/recipe.xsd">

  <author>Elizabeth David</author>
  <name>Omelette</name>
  <ingredient>3-4 eggs</ingredient>
  <ingredient>1/2 oz. butter</ingredient>
  <ingredient>Salt and pepper</ingredient>
  <method>Beat the eggs...</method>
</recipe> 

Although not explicitly marked as a system identifier we can use a catalog with a system element to associate the schema with a local copy.

<?xml version="1.0" encoding="UTF-8"?>
<catalog
  xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <system 
    systemId="http://tiling.org/xmlcatalogs/schemas/recipe.xsd"
    uri="catalog/recipe.xsd"/>

</catalog> 

Then we can use the same JAXP SAX code as before -- with one important change -- to validate the XML instance document using the local schema. The only change needed is to tell JAXP which schema language to use when performing validation. In this case it is W3C XML Schema, which is configured by setting a property on the SAXParser, as show below. Note that if the JAXP parser you are using does not implement specification version 1.2 or later, then attempting to set the property will fail by throwing an IllegalArgumentException. (It is worth mentioning in passing that for a DOM parser you set the same property name and value by calling the setAttribute() method on the DocumentBuilderFactory.)

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
factory.setValidating(true);
SAXParser parser = factory.newSAXParser();
parser.setProperty(
    "http://java.sun.com/xml/jaxp/properties/schemaLanguage",
    "http://www.w3.org/2001/XMLSchema"
);
XMLReader reader = parser.getXMLReader();
reader.setEntityResolver(new CatalogResolver());
reader.setErrorHandler(new DefaultErrorHandler());
reader.parse(inputSource);

Another benefit that catalogs offer, in addition to protection from network failure, is the ability to substitute a public resource with a local one that better fits your particular application's needs. For example, in the case of schema validation, it might be useful to validate against a local schema that imposes stronger constraints than the public one. Another way of achieving this effect -- but only in the case of schema validation -- is by explicitly instructing the parser to validate against a given schema; effectively overriding the schema location hint. Just set the property http://java.sun.com/xml/jaxp/properties/schemaSource to a value specifying the schema to use. This is explained in detail in the JAXP 1.2 maintanence specification.

Example 3: Remote Stylesheet Inclusions

For the third example of catalogs in action, we turn to XSLT transforms and see how one stylesheet can include or import another. The xsl:include instruction, which the XSLT processor replaces with the contents of the referenced stylesheet, allows stylesheet authors to split stylesheets into modular documents. For example, the following skeleton stylesheet for transforming the recipe XML file in the previous section into XHTML includes a set of public XSLT utilities.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0"
  xmlns="http://www.w3.org/1999/xhtml"
  xmlns:r="http://tiling.org/xmlcatalogs/namespaces/recipe"
  exclude-result-prefixes="r">

  <xsl:include
    href="http://tiling.org/xmlcatalogs/xslt/utils.xslt"/>
    
  ...
  
  <xsl:template match="r:recipe">
  ...    
  </xsl:template>

</xsl:stylesheet>

This time the catalog uses a uri element to specify the match for the included file reference:

<?xml version="1.0" encoding="UTF-8"?>
<catalog
  xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <uri name="http://tiling.org/xmlcatalogs/xslt/utils.xslt"
    uri="catalog/utils.xslt"/>

</catalog>

JAXP provides an interface called javax.xml.transform.URIResolver that allows applications to intercept calls to the xsl:include and xsl:import instruction, and the document() function. CatalogResolver implements this interface too, using the URI mappings from its catalog to resolve resources. So in the transform code we simply call the setURIResolver() method on the TransformerFactory, passing in an instance of CatalogResolver. Then we can create a new Transformer instance, and it will be set up to use the local file utils.xslt.

TransformerFactory factory = TransformerFactory.newInstance();
factory.setURIResolver(new CatalogResolver());
Transformer transformer = factory.newTransformer(stylesheetSource);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(inputStreamSource, result);

Developing More Complex Catalogs

XML Catalogs offer several other useful features. For instance, you can delegate a match to another catalog; and you can chain catalogs together using the nextCatalog element. Also useful is the ability to map a set of mirrored resources using a single rewrite entry, as the following catalog entry file illustrates.

<?xml version="1.0" encoding="UTF-8"?>
<catalog
  xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <rewriteSystem
    systemIdStartString="http://tiling.org/xmlcatalogs/schemas/"
    rewritePrefix="catalog/"/>
</catalog>

The rewriteSystem instructs the resolver to replace the start string for any matching system identifier with the given prefix. In this case, all schemas that begin with the string http://tiling.org/xmlcatalogs/schemas/ are mirrored in the local directory catalog/ relative to the catalog entry file.

The XML Catalogs we have seen so far have each consisted of just a single entry file with a single entry. An XML Catalog can be made up of a list of catalog entry files, each considered in turn, although subsequent files are not consulted if a match is found in an earlier file. Within each catalog entry file there are rules that govern resolution -- for a full list, see the specification. For example, system entries are considered for matching before rewriteSystem entries.

When developing larger catalogs an identifier may not be resolved to the URI you expect. It can pay to write unit tests that test resolution, perhaps by restricting net access (like the examples that accompany this article). Even with tests, however, diagnostic tools can be useful. The simplest way to see what is going on during resolution is to set the CatalogManager property verbosity to a non-zero number: the higher the number the more information you get.

You can manually try resolution from the command line using the resolver application that is supplied in the Resolver package. The following session shows resolution of an XHTML DOCTYPE, such as the one in the first example at the beginning of this article.

$ java -jar lib/resolver.jar -c catalog.xml \
  -p "-//W3C//DTD XHTML 1.0 Strict//EN" \
  -s http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd doctype
Cannot find CatalogManager.properties
Resolve DOCTYPE (name, publicid, systemid):
  public id: -//W3C//DTD XHTML 1.0 Strict//EN
  system id: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
Result: file:/tom/workspace/xmlcatalogs/catalog/xhtml1-strict.dtd

Conclusion

Using XML Catalogs to manage a local store of external resources can make your JAXP applications more robust and faster by removing the dependency on the network. Furthermore, XML Catalogs is a standard with ever increasing support -- for example, the recently released Ant 1.6 supports XML Catalogs -- so it is easy to reuse your catalog entry files.

Resources