XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Top Ten SAX2 Tips

December 05, 2001

If you write XML processing code in Java, or indeed most popular programming languages, you will be familiar with SAX, the Simple API for XML. SAX is the best API available for streaming XML processing, for processing documents as they're being read. SAX is the most flexible API in terms of data structures: since it doesn't force you to use a particular in-memory representation, you can choose the one you like best. SAX has great support for the XML Infoset (particularly in SAX2, the second version of SAX) and better DTD support than other widely available APIs. SAX2 is now part of JDK 1.4 and will soon be available to even more Java developers.

In this article, I'll highlight some points that can make your SAX programming in Java more portable, robust, and expressive. Some of these points are just advice, some address common programming problems, and some address SAX techniques that offer unique power to developers. A new book from O'Reilly, SAX2, addresses these topics and more. It details all the SAX APIs and explains each feature in more detail than this short article provides.

1. Keep it Simple

Despite being called the Simple API for XML, things are often more complicated than they first appear. SAX has grown to accommodate a lot of the flexibility needed by the tools and applications that process XML, but when you start out with SAX you should first focus on its underlying simplicity.

Related Reading

By David Brownell
January 2002 (est.)
240 pages (est.), $29.95 (est.)

Think of SAX2 (including its standardized extensions) as basically including: one parser API, two handler interfaces for content, two handler interfaces for DTD declarations (the best support of any current Java parser API), and a bunch of other classes and interfaces. Many applications can ignore most of that and start with just a few classes and interfaces:

  • XMLReader is the basic parser interface, and you get a parser object using XMLReaderFactory.

  • DefaultHandler has no-op implementations of the most popular handler methods, which you can just override.

  • Attributes wraps up the attributes of the elements reported to you.

You can write useful tools with just those APIs, overriding only three methods in the DefaultHandler class: startElement() when the parser reports the beginning of an element and its attributes, characters() to handle character data inside such elements, and endElement() to report the end of the element.

You shouldn't use any other functionality until your application requires it. Some of the tips which follow describe common reasons to use more features. Good error handling is right at the top of the list of such reasons, and if you ever process documents with DTDs, smarter handling of external entity resolution won't be far behind.

2. Buffer characters() calls

Just because a bunch of text looks to you like it's one long set of characters doesn't mean that's how a SAX parser will report it. You need to explicitly group characters that your application thinks belong together. For example, consider this XML fragment:

<asana>Vrichikasana &mdash; Scorpion</asana>

Certainly you'll see callbacks for the element boundaries, and for the various characters. But how many callbacks will you see for those characters? It'd be legal (but annoying) for the parser to report one character per callback. More typically, parsers would report the characters before and after the mdash entity reference using one callback each, and also report the entity reference (plus its contents, whatever they are). Some parsers that don't report the entity reference would make only one characters() callback for the whole thing, assuming the entity is the standard ISO entity for Unicode character U+2014. There are even more legal ways to report that simple block of text. Your event handler needs to work with all of them.

The solution is to buffer up all the characters you receive in the characters() callback. You could just append to a String, but that's not particularly efficient. It's easier to use a StringBuffer, since one StringBuffer.append() signature is an exact match for the parameters in this callback, and it's easy to turn those into a String later:

class MyHandler implements ContentHandler {
	private StringBufferchars = new StringBuffer ();

	public void characters (char buf [], int offset, int length)
	throws SAXException
		{ chars.append (buf, offset, length); }

	private String getCharacters ()
		String retval = chars.toString ();
		chars.setLength (0);
		return retval;

	... lots more in this class!

And now the interesting question is: when to collect that set of buffered characters to do something interesting with it? The answer depends on what your application is doing, but it'll usually be in endElement() or startElement(). Sometimes you'll collect the characters when there's a processingInstruction(), or, more rarely, when a comment() is reported. As a rule, avoid treating CDATA sections or entity expansions as if characters inside them were somehow special. Such boundaries are primarily for authoring convenience, and they shouldn't matter except to editor applications.

One scenario that's easy to handle is what's sometimes called "data elements" -- which contain text only and no other elements. (Their DTD content model might be (#PCDATA).) When you know that's what you're working with, collect the element's data in endElement(). That transparently ignores things like comments and PIs that might have been inside the element, as well as any entity or CDATA section boundaries found there. It's harder to give general rules for other kinds of content model, which is in part why many people like to specify the data style of element rather than allowing "mixed content" or using unrestricted content models like ANY. When a startElement() call needs to indicate the end of some text, your code can get complicated.

Remember that if you're using DTDs, you'll likely get some calls to ignorableWhitespace() to report characters in "element content" models. I usually like to just discard all such characters, since they're known to be semantically meaningless. But sometimes that's not an option, and the solution is instead to call characters() with the ignorable whitespace characters. The parameters are the same; you don't even need to reorder them.

public void ignorableWhitespace (char buf [], int offset, int length)
throws SAXException
	{ characters (buf, offset, length); }

If you used only element content models and text-only content models, it'd be easy to get all the useful text from a valid XML document. It would be the content of "data elements" that you'd get when endElement() is called or in attribute values from startElement(). The rest would be ignorable whitespace, which you'd ignore.

3. Use XMLReaderFactory for Bootstrapping

Don't hardwire your code to use a particular SAX2 parser or to rely on features of a particular parser. Good SAX-based systems build almost everything as layers over the parser rather than using nonstandard features. In fact, the best way to bootstrap a SAX2 parser hides what parser you're using: it's a simple call to a helper class:

XMLReaderparser = XMLReaderFactory.createXMLReader ();

That gives you the "system default" parser. Which parser is that? You can control that. The most reliable way is to specify the parser name on the command line, using the org.xml.sax.driver system property and the name of your parser to establish a particular JVM-wide default. You can do it like

java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader MyMainClass arg ...

Some current SAX2 distributions (SAX2 r2pre3 at this writing but not JDK 1.4) include easier ways to control the SAX2 default. One way is through a system resource that's accessed through your class loader: the META-INF/services/org.xml.sax.driver resource. That's sensitive to your class loader configuration; in some cases that may be a feature. Such recent distributions also expect redistributions (from parser suppliers) to include a compiled-in "last gasp" default, which handles the case where none of the other configuration mechanisms have been set up.

The following table gives the names of some widely used SAX2 parsers. You should avoid hardwiring such names into your source code; instead use the parser configuration mechanisms to keep your code free of parser dependencies. All of these are optionally validating, except the one labeled non-validating, and most do quite well on most XML conformance tests.

ParserClass Name
Ælfred2gnu.xml.aelfred2.SAXDriver (non-validating), or else gnu.xml.aelfred2.XmlReader
Xerces Javaorg.apache.xerces.parsers.SAXParser

If you're still using a SAX1 parser, and setting the org.xml.sax.parser system property to point to that parser, the XMLReaderFactory will fall back to that class if it can't find a native SAX2 parser implementation. You should probably upgrade to a more current implementation, but meanwhile you can continue to use your old one. It will be automagically wrapped in a ParserAdapter by the SAX2 factory.

4. Check for empty Namespace URI Strings

Namespaces have caused a lot of grief for XML developers. At first the use of namespace URIs as purely abstract identifiers caused the confusion, since they looked like URLs that would be used to fetch something (but nobody knew what). But it didn't stop there. Even today reasonable people (along with the applications and tools that they build) have very different perspectives on what it means to be in a particular namespace. It seems to be a rare month in which significant misunderstandings don't crop up in some area of namespace handling.

There's only one basic thing that programmers can do with any namespace URI: compare it to another one as a string. But not every name in an XML document has a namespace URI, and names in namespaces need to be handled differently from names that aren't in a namespace. (You can rely on either the qName or the localName to have a value, but not both. Either name will be an empty string in some cases.) You might be tempted to write code that assumes every XML element or attribute name is in a namespace, but that just doesn't match real world data. One day you'll get a document that's not quite as clean as you expect, and your code will break.

Which means that when you're writing SAX2 code to look at element or attribute names, you have to figure out whether there's even a namespace name. When there isn't, the namespace URI is always passed as an empty string. Once you know which kind of name you're working with, you can figure out how to handle the element or attribute in question. Inline code to do name-based dispatching should look something like the following (for elements); notice that it doesn't even know there's such a thing as a namespace prefix:

public void startElement (
	String uri, String localName,
	String qName, Attributes atts
) throws SAXException
	// Handle elements not in any namespace
	if ("".equals (uri)) {
		// these only have "qName"
		if ("dolce".equals (qName)) {
			// ... handle "dolce"
		} else if ("vita".equals (qName)) {
			// ... handle "vita"

		... and all other supported "no namespace" elements
		} else
			error ("unrecognized element name: " + qName);

	// Then handle each supported namespace separately
	} else if ("http://www.example.com/namespaces/ns1".equals (uri)) {
		// these have a "localName" with no prefix
		if ("free".equals (localName)) {
			// ... handle "free"
		} else if ("open".equals (localName)) {
		// ... handle "open"

		... and all other supported NS1 elements
		} else
			error ("unrecognized NS1 element name: " + localName);

		... and similarly for all other supported element namespaces
		} else
			error ("unrecognized element namespace: " + uri);

Comment on this article Share comments or questions in our forum.
Post your comments

Attributes might not need that kind of handling. Applications often "know about" particular attributes, access them by name, and just ignore any unrecognized attributes. If you're accessing attribute values in that way, just make sure you use the right naming convention, either Attributes.getValue(uri,local) or Attributes.getValue(qName), and you should have no problems.

Otherwise you'll be scanning all of an element's attributes. You'll need to check whether each attribute is in a namespace, just like you checked whether its element was in a namespace. If it's not in a namespace, you probably know a bit more about the attribute than in the case of an element that's not in a namespace. It's either going to be associated with that element's type or, if you've enabled reporting of namespace prefixes, it'll be a namespace declaration. (That's required by the Namespaces in XML specification, but DOM and the XML Infoset have chosen instead to put such declarations into a namespace.) Your code might look something like this:

Attributeatts = ...;
intlength = atts.getLength ();

for (int i = 0; i < length; i++) {
	String	uri = atts.getURI (i);

	if ("".equals (uri)) {
	String	qName = atts.getQName (i);

		// ... then dispatch based on qName
		// including error based on unrecognized name
		// "xmlns" and "xmlns:*" declarations would appear here

	} else if ("http://www.example.com/namespaces/ns1".equals (uri)) {
		String	localName = atts.getLocalName (i);

		// ... then dispatch based on "localName"
		// including error based on unrecognized name

		... and similarly for all other supported attribute namespaces
		} else
			error ("unrecognized attribute namespace: " + uri);

If your code uses idioms like those shown above, it'll be handling namespaces correctly. Otherwise, you're likely to run into a document or parser that confuses your code. Don't try to ignore namespaces completely. If your code wants a simpler "pre-namespaces" view of the world, at least make sure the namespace URI is always empty and report errors for all elements and attributes where that's not true.

Pages: 1, 2

Next Pagearrow