XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Top Ten SAX2 Tips
by David Brownell | Pages: 1, 2

5. Provide an ErrorHandler, especially when you're validating

If you've never set up a parser to validate, like

parser.setFeature (
	"http://xml.org/sax/features/validation",
	true);

and then been surprised when you didn't get any error reports, congratulations! You're in the minority. Most developers have forgotten (and usually more than once) that by default, validity errors are ignored by SAX parsers. You need to call XMLReader.setErrorHandler() with some useful error handler to make validity errors have any effect at all. That handler needs to do something interesting with the validity errors reported to it using error() calls.

It's worth having a good utility class that you reuse and reconfigure it to handle this particular situation. It'll be handy even when you're not validating. Such a class might look like

class MyErrorHandler implements ErrorHandler
{
	private void print (String label, SAXParseException e)
	{
		System.err.println ("** " + label + ": " + e.getMessage ());
		System.err.println ("   URI  = " + e.getSystemId ());
		System.err.println ("   line = " + e.getLineNumber ());
	}

	booleanreportErrors = true;
	booleanabortOnError = false;

	// for recoverable errors, like validity problems
	public void error (SAXParseException e)
	throws SAXException
	{
		if (reportErrors)
			print ("error", e);
	if (abortOnErrors)
		throw e;
	}

		... plus similar for fatalError(), warning()
		... and maybe more interesting configuration support
}

A SAX ErrorHandler should know two policies for each of its three fault classes: whether to report such faults, and whether such faults should terminate parsing. Various mechanisms can be used to report the fault, such as logging, adding text to a Swing message window, or just printing. At this time, SAX doesn't support portable mechanisms to identify particular failure modes, so that you can't really consider "why did it fail?" in the handler.

6. Share your ErrorHandler between the XMLReader and your own Handlers

Comment on this article Share comments or questions in our forum.
Post your comments

When your application uses the same ErrorHandler in its own handlers and for the parser, it creates an integrated stream of fault information. That's useful in its own right, but the best part is that all the errors (and warnings) can then be handled according to the same policy and mechanism. You can easily change how faults are handled by switching or reconfiguring that ErrorHandler object. In most cases, the SAX fault classifications are fine, since having more than fatalError(), error(), and warning() will rarely be helpful. Here's how you might set this up for a simple handler:

public class MyHandler implements ContentHandler
{
	// doesn't matter if this stays as null, since
	// SAXParseException constructors don't care
	private Locatorlocator;

	public void setDocumentLocator (Locator l)
		{ locator = l; }

	// application and SAX errors should use the same handler,
	private ErrorHandlereh = new DefaultHandler ();

	public void setErrorHandler (ErrorHandler e)
		{ eh = (e == null) ? new DefaultHandler () : e; }

	// simpler is usually better ...
	public final void error (String message) throws SAXException
		{ eh.error (new SAXParseException (message, locator)); }
	public final void warning (String message) throws SAXException
		{ eh.warning (new SAXParseException (message, locator)); }
	public final void fatalError (String message) throws SAXException
	{	
		SAXParseException e = new SAXParseException (message, locator);
		eh.fatalError (e);
		// in case eh tries to continue:  we can't, and won't
		throw e;
	}

	// the real application code would use error(String) and friends
	// to report errors, something like this:

	public void endElement (String uri, String localName, String qName)
	throws SAXException
		{
			... branch to figure out which element's processing to do ...

			if (processData (getCharacters ()) != true) {
			error ("bad '" + localName + "' content");
			// recover from it (clean up state) and
			return;
		}
		... now repackage and save all the object's state
	}


	... lots more code
}

Then you should initialize both the XMLReader and your content handlers (including any that process DTD content) to use the same ErrorHandler. The SAX ErrorHandler interface is flexible enough to use as a general error handling policy interface in much of your XML code. In fact, you may have noticed that the javax.xml.parsers.DocumentBuilder class uses one to simplify error reporting when building a DOM Document.

If you want, your application can subclass SAXParseException to provide some application-specific exception information, which might be understood by that error handler. It might use information about what happened to make more enlightened decisions about how to handle the problem.

7. Track Context with a Stack

Once developers get past the initial milestone of learning how SAX parser callbacks map to the input text, the next step is to figure out how to turn such a stream of callbacks into application data. Certainly SAX is low overhead, and no other API is likely to get less in the way. At the same time, SAX is not exactly going out of its way to package things neatly. It's the very fact that SAX doesn't pick data structures for you that makes it so powerful. That can take getting used to, particularly if you're used to thinking in terms of structures that someone else designed.

A good place to start is to make a ContentHandler implementation that keeps important information in a stack. For example, you could define a class that records an element name (with its namespace, if any) and uses the AttributesImpl class to snapshot its associated attributes. If you create those entries in startElement() and stack them, any callback could use that information before endElement() popped the stack. Certain attributes, including xml:base, xml:lang, and xml:space, are in a sense "inherited", and you might need to walk up that stack to find such a value while processing other event callbacks.

Such stack entries are also convenient places to collect application-specific information about an element's children. For example, you might be unmarshaling a series of data elements, converting them from strings into more specialized data types as you parse. You'd store those converted values in members of that special stack entry, reporting application level errors when they're detected. Periodically you could transform such entries (or subtrees of entries) into custom data structures that might no longer reflect the way XML text happened to encode that data.

Of course if you track every data item that comes in through SAX, you're starting down a well trodden path. There are plenty of APIs that do that, optimized for one model or another but likely not for your particular application. Still, it can be good fun and useful to build up SAX infrastructure for your application that way.

8. Use an InputSource to wrap in-memory data

New SAX programmers often end up with some data in memory, perhaps in a string or other data buffer, that needs to be parsed as XML. (Maybe it came from a database or was built by some other program component.) It's easy to use SAX to parse these, since the java.io package provides classes that let you create character streams from character data. You can use CharArrayReader to read from arrays of characters, or StringReader as shown here when the data starts as a string:

Readerreader;
InputSourcein;
XMLReaderparser;

reader = new StringReader ("<bank name='Gringott&apos;s' box='713'/>");
in = new InputSource (reader);
parser = XMLReaderFactory.createXMLReader ();

parser.parse (in);

You can do similar things with byte arrays, using the ByteArrayInputStream class to create a byte stream, but in that case you've got to be careful about character encoding issues. It's best if those bytes are UTF-8 encoded XML data.

Such input sources can be used as direct parser inputs (as shown here) or, if you're using DTDs and entities defined in them, through an EntityResolver.

9. Manage External Entity Lookups with an EntityResolver

XML uses external entities to support document modularity; they are available if you're using DTDs. When a document references an entity, parsers normally fetch it and parse the result. That's exactly what you need in most cases, but it causes problems when the server hosting that URL goes offline for a while (or maybe it was your client that wanted to be disconnected?), and when the network is unreliable. Your whole application could become unavailable, just because it's trying to get a resource that can't be gotten.

How can you avoid entity access problems? SAX2 gives you two basic controls over entity processing.

First, two SAX2 feature flags control whether external entities are ever fetched. One affects parameter entities (like %module;) which are used inside the DTD. The other affects general entities (like &data;) in the body of the document. Most SAX parsers don't let you turn of this fetching, but if you're using one which does, this may be a fine solution. (The current Ælfred2 release supports this, but I don't know another SAX2 parser that does.) So you may not be able to use this facility.

Second, you can use an EntityResolver to control how entities are resolved. Whenever a SAX parser needs to access an external entity, it will ask the resolveEntity() method on your resolver how to handle that entity. That method sees the entity's fully resolved URI and, if it had one, its public ID. (A new SAX extension is in the works to provide more information, but it's not widely supported yet.) Some interesting things for that method to do include:

  • Map public IDs to local file names. That's what public IDs were designed for, and hashtables were designed for such mappings. Strongly encouraged! You can do the same thing for system IDs. (There are also "catalog" systems to help manage such mappings. You may want to use a resolver that knows how to use one.)

  • Fetch or compute the data, maybe using a database. If you're using a private URI scheme that your JVM doesn't understand, maybe blob:database-name:database-key, you'll probably want to store those in the public IDs and do the URI resolution yourself.

  • Construct an empty input source and return that. This is safe to do for general entities, after the first startElement(), and a bit dangerous for parameter entities, but you may be better off trying to skip some remote entities than trying to access them. (The issue with handling parameter entities this way is that the parser won't know it didn't see their declarations, and so it won't behave correctly.)

A simple entity resolver might look like this for an application that's really paranoid about preventing access to all entities it doesn't control. If you were using it, you'd probably preload the hashtable with entries for all of your application's entities. And you'd probably apply intelligence about what requests are really unsafe or your customers would get unhappy. For example, maybe string prefix matches would be used to grant access to certain files inside the firewall (or its DMZ), and only the ones outside that security boundary would be airbrushed out of the picture.

class MyResolver implements EntityResolver
{
	private Hashtablepublics, systems;

	MyResolver (Hashtable pub, Hashtable sys)
		{ publics = pub; systems = sys; }

	public InputSource resolveEntity (String publicId, String systemId)
	throws IOException, SAXException
	{
		InputSourceretval = null;

		if (publicId != null) {
		String	value = (String) publics.get (publicId);

		if (value != null) {
			// use new system ID and original public ID
			retval = new InputSource (retval);
			retval.setPublicId (publicId);
		}
	}
	if (retval == null) {
		String	value = (String) systems.get (systemId);

		if (value != null) {
			// use new system ID and original public ID
			retval = new InputSource (retval);
			retval.setPublicId (publicId);
		}
	}
	if (retval == null) {
		// we're sooo paranoid here!!
		System.err.println ("RESOLVER: punt " + systemId + " "
			+ (publicId == null ? "" : publicId));
		retval = new InputSource (new StringReader (""));
		retval.setSystemId (systemId);
		retval.setPublicId (publicId);
	}
	// if we returned null, the systemId would would
	// be dereferenced using standard URL handling.
	return retval;
  }
}

A good rule of thumb is always to use a resolver for any application that reuses a known set of DTDs. Do it, if for no other reason than to avoid accessing the network when you don't need to. Only mission critical servers would likely want to be as paranoid as shown above.

10. Use a Pipelined Processing Model

SAX is made for streaming processing, and the best way to stream your processing is to connect a series of processing components into an event pipeline. One component produces events, the next consumes them and produces new (or maybe filtered) events for yet another component to consume. Often, both your CPU and I/O subsystems can be working on different parts of the pipeline at the same time, minimizing elapsed time.

SAX parsers produce events, but they're not the only way to produce a stream of SAX events. One common practice is to have programs call the SAX event methods directly, perhaps while walking over a data structure as part of converting it to XML. SAX2 defines a way to make a SAX parser that walks a DOM tree, rather than XML text, emitting a stream of SAX events. And toolsets like DOM4J and JDOM haven't neglected such data-to-SAX converters, either. Think of that SAX event stream as an efficient in-memory version of the generic transfer syntax which XML provides between different processes.

Your "ultimate consumer" in a SAX event pipeline could write XML text out (use one of the various XMLWriter classes) or turn the events into a application-optimized data structure. It's easy to build a DOM (or DOM4J, or JDOM) model from a modified SAX event stream, too. And since you have control over what happens, you don't have to build the entire generic tree structure before you begin processing it; if you do it that way, you can garbage collect each chunk of data as soon as you're done processing it, rather than waiting for the whole document to materialize in memory.

If you're using XSLT in Java, you may well be familiar with the javax.xml.transform.sax (TRAX) package. XSLT engines such as SAXON or Xalan support it. You may not know that it's easy to feed SAX events as inputs to an XSLT engine as a SAX pipeline stage, using a TransformerHandler,or to collect XSLT engine output as SAX events using a SAXResult. SAX events in, transformation according to XSLT, and then SAX events out again: those TRAX APIs are essentially wrappers around SAX pipeline stages! It can be very worthwhile to unwrap them and use XSLT for some heavier weight transformations in your SAX pipelines.

I could go on about pipelines, but I'll just mention that SAX2 includes a XMLFilterImpl class, handy for writing some kinds of intermediate pipeline stages, and stop. Pipelines are covered in more detail in that new book that I mentioned. The main thing to remember is that event pipelines are the natural model for components in SAX. You should plan to use them if you're doing anything very substantial.

If you've read this far, you deserve a special bonus tip. SAX has its own site, http://www.saxproject.org. Visit it site for the the latest information updated documentation about SAX.

David Brownell, author of SAX2, is a software engineer. He recently worked for three years at JavaSoft, where he provided Sun's XML and DOM software, SSL and public key technologies, the original version of the JavaServer Pages technology, and worked on the Java Servlet API for Web servers.

O'Reilly & Associates will soon release (January 2002) SAX2.



1 to 3 of 3
  1. cosplay
    2010-07-27 23:44:23 cosplaywedding
  2. SAX -parsing through reoccuring elements
    2004-04-08 05:30:17 Tanya Green
  3. Top Ten SAX2 Tips
    2002-02-26 16:29:37 Sam Assatov
1 to 3 of 3