Top Ten SAX2 Tips
by David Brownell
|
Pages: 1, 2
5. Provide an ErrorHandler, especially when you're validating
If you've never set up a parser to validate, like
parser.setFeature ( "http://xml.org/sax/features/validation", true);
and then been surprised when you didn't get any error reports,
congratulations! You're in the minority. Most developers have
forgotten (and usually more than once) that by default, validity
errors are ignored by SAX parsers. You need to call
XMLReader.setErrorHandler() with some useful error
handler to make validity errors have any effect at all. That handler
needs to do something interesting with the validity errors reported to
it using error() calls.
It's worth having a good utility class that you reuse and reconfigure it to handle this particular situation. It'll be handy even when you're not validating. Such a class might look like
class MyErrorHandler implements ErrorHandler
{
private void print (String label, SAXParseException e)
{
System.err.println ("** " + label + ": " + e.getMessage ());
System.err.println (" URI = " + e.getSystemId ());
System.err.println (" line = " + e.getLineNumber ());
}
booleanreportErrors = true;
booleanabortOnError = false;
// for recoverable errors, like validity problems
public void error (SAXParseException e)
throws SAXException
{
if (reportErrors)
print ("error", e);
if (abortOnErrors)
throw e;
}
... plus similar for fatalError(), warning()
... and maybe more interesting configuration support
}
A SAX ErrorHandler should know two policies for each of its three fault classes: whether to report such faults, and whether such faults should terminate parsing. Various mechanisms can be used to report the fault, such as logging, adding text to a Swing message window, or just printing. At this time, SAX doesn't support portable mechanisms to identify particular failure modes, so that you can't really consider "why did it fail?" in the handler.
6. Share your ErrorHandler
between the XMLReader and your own Handlers
|
|
| Post your comments |
When your application uses the same
ErrorHandler in its own handlers and for the parser, it creates an
integrated stream of fault information. That's useful in its own
right, but the best part is that all the errors (and warnings) can
then be handled according to the same policy and mechanism. You can
easily change how faults are handled by switching or reconfiguring
that ErrorHandler object. In most cases, the SAX fault
classifications are fine, since having more than
fatalError(), error(), and
warning() will rarely be helpful. Here's how you might
set this up for a simple handler:
public class MyHandler implements ContentHandler
{
// doesn't matter if this stays as null, since
// SAXParseException constructors don't care
private Locatorlocator;
public void setDocumentLocator (Locator l)
{ locator = l; }
// application and SAX errors should use the same handler,
private ErrorHandlereh = new DefaultHandler ();
public void setErrorHandler (ErrorHandler e)
{ eh = (e == null) ? new DefaultHandler () : e; }
// simpler is usually better ...
public final void error (String message) throws SAXException
{ eh.error (new SAXParseException (message, locator)); }
public final void warning (String message) throws SAXException
{ eh.warning (new SAXParseException (message, locator)); }
public final void fatalError (String message) throws SAXException
{
SAXParseException e = new SAXParseException (message, locator);
eh.fatalError (e);
// in case eh tries to continue: we can't, and won't
throw e;
}
// the real application code would use error(String) and friends
// to report errors, something like this:
public void endElement (String uri, String localName, String qName)
throws SAXException
{
... branch to figure out which element's processing to do ...
if (processData (getCharacters ()) != true) {
error ("bad '" + localName + "' content");
// recover from it (clean up state) and
return;
}
... now repackage and save all the object's state
}
... lots more code
}
Then you should initialize both the XMLReader and your
content handlers (including any that process DTD content) to use the
same ErrorHandler. The SAX ErrorHandler interface is flexible enough
to use as a general error handling policy interface in much of your
XML code. In fact, you may have noticed that the
javax.xml.parsers.DocumentBuilder class uses one to
simplify error reporting when building a DOM Document.
If you want, your application can subclass
SAXParseException to provide some application-specific
exception information, which might be understood by that error
handler. It might use information about what happened to make more
enlightened decisions about how to handle the problem.
7. Track Context with a Stack
Once developers get past the initial milestone of learning how SAX parser callbacks map to the input text, the next step is to figure out how to turn such a stream of callbacks into application data. Certainly SAX is low overhead, and no other API is likely to get less in the way. At the same time, SAX is not exactly going out of its way to package things neatly. It's the very fact that SAX doesn't pick data structures for you that makes it so powerful. That can take getting used to, particularly if you're used to thinking in terms of structures that someone else designed.
A good place to start is to make a ContentHandler
implementation that keeps important information in a stack. For
example, you could define a class that records an element name (with
its namespace, if any) and uses the AttributesImpl class
to snapshot its associated attributes. If you create those entries in
startElement() and stack them, any callback could use
that information before endElement() popped the stack.
Certain attributes, including xml:base,
xml:lang, and xml:space, are in a sense
"inherited", and you might need to walk up that stack to find such a
value while processing other event callbacks.
Such stack entries are also convenient places to collect application-specific information about an element's children. For example, you might be unmarshaling a series of data elements, converting them from strings into more specialized data types as you parse. You'd store those converted values in members of that special stack entry, reporting application level errors when they're detected. Periodically you could transform such entries (or subtrees of entries) into custom data structures that might no longer reflect the way XML text happened to encode that data.
Of course if you track every data item that comes in through SAX, you're starting down a well trodden path. There are plenty of APIs that do that, optimized for one model or another but likely not for your particular application. Still, it can be good fun and useful to build up SAX infrastructure for your application that way.
8. Use an InputSource to wrap in-memory data
New SAX programmers often end up with some
data in memory, perhaps in a string or other data buffer, that needs
to be parsed as XML. (Maybe it came from a database or was built by
some other program component.) It's easy to use SAX to parse these,
since the java.io package provides classes that let you
create character streams from character data. You can use
CharArrayReader to read from arrays of characters, or
StringReader as shown here when the data starts as a
string:
Readerreader;
InputSourcein;
XMLReaderparser;
reader = new StringReader ("<bank name='Gringott's' box='713'/>");
in = new InputSource (reader);
parser = XMLReaderFactory.createXMLReader ();
parser.parse (in);
You can do similar things with byte arrays, using the
ByteArrayInputStream class to create a byte stream, but
in that case you've got to be careful about character encoding issues.
It's best if those bytes are UTF-8 encoded XML data.
Such input sources can be used as direct parser inputs (as shown here) or, if you're using DTDs and entities defined in them, through an EntityResolver.
9. Manage External Entity Lookups with an
EntityResolver
XML uses external entities to support document modularity; they are available if you're using DTDs. When a document references an entity, parsers normally fetch it and parse the result. That's exactly what you need in most cases, but it causes problems when the server hosting that URL goes offline for a while (or maybe it was your client that wanted to be disconnected?), and when the network is unreliable. Your whole application could become unavailable, just because it's trying to get a resource that can't be gotten.
How can you avoid entity access problems? SAX2 gives you two basic controls over entity processing.
First, two SAX2 feature flags control whether external entities are
ever fetched. One affects parameter entities (like %module;)
which are used inside the DTD. The other affects general entities
(like &data;) in the body of the document. Most SAX
parsers don't let you turn of this fetching, but if you're using one
which does, this may be a fine solution. (The current Ælfred2
release supports this, but I don't know another SAX2 parser that
does.) So you may not be able to use this facility.
Second, you can use an EntityResolver to control how
entities are resolved. Whenever a SAX parser needs to access an
external entity, it will ask the resolveEntity() method
on your resolver how to handle that entity. That method sees the
entity's fully resolved URI and, if it had one, its public ID. (A new
SAX extension is in the works to provide more information, but it's
not widely supported yet.) Some interesting things for that method to
do include:
Map public IDs to local file names. That's what public IDs were designed for, and hashtables were designed for such mappings. Strongly encouraged! You can do the same thing for system IDs. (There are also "catalog" systems to help manage such mappings. You may want to use a resolver that knows how to use one.)
Fetch or compute the data, maybe using a database. If you're using a private URI scheme that your JVM doesn't understand, maybe
blob:database-name:database-key, you'll probably want to store those in the public IDs and do the URI resolution yourself.Construct an empty input source and return that. This is safe to do for general entities, after the first
startElement(), and a bit dangerous for parameter entities, but you may be better off trying to skip some remote entities than trying to access them. (The issue with handling parameter entities this way is that the parser won't know it didn't see their declarations, and so it won't behave correctly.)
A simple entity resolver might look like this for an application that's really paranoid about preventing access to all entities it doesn't control. If you were using it, you'd probably preload the hashtable with entries for all of your application's entities. And you'd probably apply intelligence about what requests are really unsafe or your customers would get unhappy. For example, maybe string prefix matches would be used to grant access to certain files inside the firewall (or its DMZ), and only the ones outside that security boundary would be airbrushed out of the picture.
class MyResolver implements EntityResolver
{
private Hashtablepublics, systems;
MyResolver (Hashtable pub, Hashtable sys)
{ publics = pub; systems = sys; }
public InputSource resolveEntity (String publicId, String systemId)
throws IOException, SAXException
{
InputSourceretval = null;
if (publicId != null) {
String value = (String) publics.get (publicId);
if (value != null) {
// use new system ID and original public ID
retval = new InputSource (retval);
retval.setPublicId (publicId);
}
}
if (retval == null) {
String value = (String) systems.get (systemId);
if (value != null) {
// use new system ID and original public ID
retval = new InputSource (retval);
retval.setPublicId (publicId);
}
}
if (retval == null) {
// we're sooo paranoid here!!
System.err.println ("RESOLVER: punt " + systemId + " "
+ (publicId == null ? "" : publicId));
retval = new InputSource (new StringReader (""));
retval.setSystemId (systemId);
retval.setPublicId (publicId);
}
// if we returned null, the systemId would would
// be dereferenced using standard URL handling.
return retval;
}
}
A good rule of thumb is always to use a resolver for any application that reuses a known set of DTDs. Do it, if for no other reason than to avoid accessing the network when you don't need to. Only mission critical servers would likely want to be as paranoid as shown above.
10. Use a Pipelined Processing Model
SAX is made for streaming processing, and the best way to stream your processing is to connect a series of processing components into an event pipeline. One component produces events, the next consumes them and produces new (or maybe filtered) events for yet another component to consume. Often, both your CPU and I/O subsystems can be working on different parts of the pipeline at the same time, minimizing elapsed time.
SAX parsers produce events, but they're not the only way to produce a stream of SAX events. One common practice is to have programs call the SAX event methods directly, perhaps while walking over a data structure as part of converting it to XML. SAX2 defines a way to make a SAX parser that walks a DOM tree, rather than XML text, emitting a stream of SAX events. And toolsets like DOM4J and JDOM haven't neglected such data-to-SAX converters, either. Think of that SAX event stream as an efficient in-memory version of the generic transfer syntax which XML provides between different processes.
Your "ultimate consumer" in a SAX event pipeline could write XML
text out (use one of the various XMLWriter classes) or
turn the events into a application-optimized data structure. It's
easy to build a DOM (or DOM4J, or JDOM) model from a modified SAX
event stream, too. And since you have control over what happens, you
don't have to build the entire generic tree structure before you begin
processing it; if you do it that way, you can garbage collect each
chunk of data as soon as you're done processing it, rather than
waiting for the whole document to materialize in memory.
If you're using XSLT in Java, you may well be familiar with the
javax.xml.transform.sax (TRAX) package. XSLT engines
such as SAXON or Xalan support it. You may not know that it's easy to
feed SAX events as inputs to an XSLT engine as a SAX pipeline stage,
using a TransformerHandler,or to collect XSLT engine
output as SAX events using a SAXResult. SAX events in,
transformation according to XSLT, and then SAX events out again: those
TRAX APIs are essentially wrappers around SAX pipeline stages! It can
be very worthwhile to unwrap them and use XSLT for some heavier weight
transformations in your SAX pipelines.
I could go on about pipelines, but I'll just mention that SAX2
includes a XMLFilterImpl class, handy for writing some
kinds of intermediate pipeline stages, and stop. Pipelines are
covered in more detail in that new book
that I mentioned. The main thing to remember is that event pipelines
are the natural model for components in SAX. You should plan to use
them if you're doing anything very substantial.
If you've read this far, you deserve a special bonus tip. SAX has its own site, http://www.saxproject.org. Visit it site for the the latest information updated documentation about SAX.
David Brownell, author of SAX2, is a software engineer. He recently worked for three years at JavaSoft, where he provided Sun's XML and DOM software, SSL and public key technologies, the original version of the JavaServer Pages technology, and worked on the Java Servlet API for Web servers.O'Reilly & Associates will soon release (January 2002) SAX2.
You can also look at the Full Description of the book.
For more information, or to order the book, click here.
- cosplay
2010-07-27 23:44:23 cosplaywedding - SAX -parsing through reoccuring elements
2004-04-08 05:30:17 Tanya Green - Top Ten SAX2 Tips
2002-02-26 16:29:37 Sam Assatov - Top Ten SAX2 Tips
2007-11-23 19:00:46 mongthu