Top Ten SAX2 Tips
If you write XML processing code in Java, or indeed most popular programming languages, you will be familiar with SAX, the Simple API for XML. SAX is the best API available for streaming XML processing, for processing documents as they're being read. SAX is the most flexible API in terms of data structures: since it doesn't force you to use a particular in-memory representation, you can choose the one you like best. SAX has great support for the XML Infoset (particularly in SAX2, the second version of SAX) and better DTD support than other widely available APIs. SAX2 is now part of JDK 1.4 and will soon be available to even more Java developers.
In this article, I'll highlight some points that can make your SAX programming in Java more portable, robust, and expressive. Some of these points are just advice, some address common programming problems, and some address SAX techniques that offer unique power to developers. A new book from O'Reilly, SAX2, addresses these topics and more. It details all the SAX APIs and explains each feature in more detail than this short article provides.
Despite being called the Simple API for XML, things are often more complicated than they first appear. SAX has grown to accommodate a lot of the flexibility needed by the tools and applications that process XML, but when you start out with SAX you should first focus on its underlying simplicity.
|
Related Reading
|
Think of SAX2 (including its standardized extensions) as basically including: one parser API, two handler interfaces for content, two handler interfaces for DTD declarations (the best support of any current Java parser API), and a bunch of other classes and interfaces. Many applications can ignore most of that and start with just a few classes and interfaces:
XMLReader is the basic
parser interface, and you get a parser object using
XMLReaderFactory.
DefaultHandler
has no-op implementations of the most popular handler
methods, which you can just override.
Attributes wraps up
the attributes of the elements reported to you.
You can write useful tools with just those APIs,
overriding only three methods in the DefaultHandler class:
startElement() when the parser reports the
beginning of an element and its attributes,
characters() to handle character data inside
such elements, and
endElement() to report the end of the element.
You shouldn't use any other functionality until your application requires it. Some of the tips which follow describe common reasons to use more features. Good error handling is right at the top of the list of such reasons, and if you ever process documents with DTDs, smarter handling of external entity resolution won't be far behind.
characters() callsJust because a bunch of text looks to you like it's one long set of characters doesn't mean that's how a SAX parser will report it. You need to explicitly group characters that your application thinks belong together. For example, consider this XML fragment:
<asana>Vrichikasana — Scorpion</asana>
Certainly you'll see callbacks for the element boundaries, and for
the various characters. But how many callbacks will you see for those
characters? It'd be legal (but annoying) for the parser to report one
character per callback. More typically, parsers would report the
characters before and after the mdash entity reference
using one callback each, and also report the entity reference (plus
its contents, whatever they are). Some parsers that don't report the
entity reference would make only one characters()
callback for the whole thing, assuming the entity is the standard ISO
entity for Unicode character U+2014. There are even more legal ways
to report that simple block of text. Your event handler needs to work
with all of them.
The solution is to buffer up all the characters you receive in the
characters() callback. You could just append to a
String, but that's not particularly efficient. It's easier to use a
StringBuffer, since one StringBuffer.append() signature
is an exact match for the parameters in this callback, and it's easy
to turn those into a String later:
class MyHandler implements ContentHandler {
private StringBufferchars = new StringBuffer ();
public void characters (char buf [], int offset, int length)
throws SAXException
{ chars.append (buf, offset, length); }
private String getCharacters ()
{
String retval = chars.toString ();
chars.setLength (0);
return retval;
}
... lots more in this class!
}
And now the interesting question is: when to collect that set of
buffered characters to do something interesting with it? The answer
depends on what your application is doing, but it'll usually be in
endElement() or startElement(). Sometimes
you'll collect the characters when there's a
processingInstruction(), or, more rarely, when a
comment() is reported. As a rule, avoid treating CDATA
sections or entity expansions as if characters inside them were
somehow special. Such boundaries are primarily for authoring
convenience, and they shouldn't matter except to editor
applications.
One scenario that's easy to handle is what's sometimes called "data
elements" -- which contain text only and no other elements. (Their
DTD content model might be (#PCDATA).) When you know
that's what you're working with, collect the element's data in
endElement(). That transparently ignores things like
comments and PIs that might have been inside the element, as well as
any entity or CDATA section boundaries found there. It's harder to
give general rules for other kinds of content model, which is in part
why many people like to specify the data style of element rather than
allowing "mixed content" or using unrestricted content models like
ANY. When a startElement() call needs to indicate the
end of some text, your code can get complicated.
Remember that if you're using DTDs, you'll likely get some calls to
ignorableWhitespace() to report characters in "element
content" models. I usually like to just discard all such characters,
since they're known to be semantically meaningless. But sometimes
that's not an option, and the solution is instead to call
characters() with the ignorable whitespace characters.
The parameters are the same; you don't even need to reorder them.
public void ignorableWhitespace (char buf [], int offset, int length)
throws SAXException
{ characters (buf, offset, length); }
If you used only element content models and text-only content
models, it'd be easy to get all the useful text from a valid XML
document. It would be the content of "data elements" that you'd get
when endElement() is called or in attribute values from
startElement(). The rest would be ignorable whitespace,
which you'd ignore.
XMLReaderFactory for Bootstrapping
Don't hardwire your code to use a particular SAX2 parser or to rely on features of a particular parser. Good SAX-based systems build almost everything as layers over the parser rather than using nonstandard features. In fact, the best way to bootstrap a SAX2 parser hides what parser you're using: it's a simple call to a helper class:
XMLReaderparser = XMLReaderFactory.createXMLReader ();
That gives you the "system default" parser. Which parser is that?
You can control that. The most reliable way is to specify the parser
name on the command line, using the org.xml.sax.driver
system property and the name of your parser to establish a particular
JVM-wide default. You can do it like
java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader MyMainClass arg ...
Some current SAX2 distributions (SAX2 r2pre3 at this writing but
not JDK 1.4) include easier ways to control the SAX2 default. One way
is through a system resource that's accessed through your class
loader: the META-INF/services/org.xml.sax.driver
resource. That's sensitive to your class loader configuration; in
some cases that may be a feature. Such recent distributions also
expect redistributions (from parser suppliers) to include a
compiled-in "last gasp" default, which handles the case where none of
the other configuration mechanisms have been set up.
The following table gives the names of some widely used SAX2 parsers. You should avoid hardwiring such names into your source code; instead use the parser configuration mechanisms to keep your code free of parser dependencies. All of these are optionally validating, except the one labeled non-validating, and most do quite well on most XML conformance tests.
| Parser | Class Name |
|---|---|
| Ælfred2 | gnu.xml.aelfred2.SAXDriver
(non-validating), or else
gnu.xml.aelfred2.XmlReader |
| Crimson | org.apache.crimson.parser.XmlReaderImpl |
| Oracle | oracle.xml.parser.v2.SAXParser |
| Xerces Java | org.apache.xerces.parsers.SAXParser |
If you're still using a SAX1 parser, and setting the
org.xml.sax.parser system property to point to that
parser, the XMLReaderFactory will fall back to that class
if it can't find a native SAX2 parser implementation. You should
probably upgrade to a more current implementation, but meanwhile you
can continue to use your old one. It will be automagically wrapped in
a ParserAdapter by the SAX2 factory.
Namespaces have caused a lot of grief for XML developers. At first the use of namespace URIs as purely abstract identifiers caused the confusion, since they looked like URLs that would be used to fetch something (but nobody knew what). But it didn't stop there. Even today reasonable people (along with the applications and tools that they build) have very different perspectives on what it means to be in a particular namespace. It seems to be a rare month in which significant misunderstandings don't crop up in some area of namespace handling.
There's only one basic thing that programmers can do with any
namespace URI: compare it to another one as a string. But not every
name in an XML document has a namespace URI, and names in namespaces
need to be handled differently from names that aren't in a namespace.
(You can rely on either the qName or the
localName to have a value, but not both. Either
name will be an empty string in some cases.) You might be tempted to
write code that assumes every XML element or attribute name is in a
namespace, but that just doesn't match real world data. One day
you'll get a document that's not quite as clean as you expect, and
your code will break.
Which means that when you're writing SAX2 code to look at element or attribute names, you have to figure out whether there's even a namespace name. When there isn't, the namespace URI is always passed as an empty string. Once you know which kind of name you're working with, you can figure out how to handle the element or attribute in question. Inline code to do name-based dispatching should look something like the following (for elements); notice that it doesn't even know there's such a thing as a namespace prefix:
public void startElement (
String uri, String localName,
String qName, Attributes atts
) throws SAXException
{
// Handle elements not in any namespace
if ("".equals (uri)) {
// these only have "qName"
if ("dolce".equals (qName)) {
// ... handle "dolce"
} else if ("vita".equals (qName)) {
// ... handle "vita"
... and all other supported "no namespace" elements
} else
error ("unrecognized element name: " + qName);
// Then handle each supported namespace separately
} else if ("http://www.example.com/namespaces/ns1".equals (uri)) {
// these have a "localName" with no prefix
if ("free".equals (localName)) {
// ... handle "free"
} else if ("open".equals (localName)) {
// ... handle "open"
... and all other supported NS1 elements
} else
error ("unrecognized NS1 element name: " + localName);
... and similarly for all other supported element namespaces
} else
error ("unrecognized element namespace: " + uri);
}
|
|
| Post your comments |
Attributes might not need that kind of handling. Applications often "know about" particular attributes, access them by name, and just ignore any unrecognized attributes. If you're accessing attribute values in that way, just make sure you use the right naming convention, either Attributes.getValue(uri,local) or Attributes.getValue(qName), and you should have no problems.
Otherwise you'll be scanning all of an element's attributes. You'll need to check whether each attribute is in a namespace, just like you checked whether its element was in a namespace. If it's not in a namespace, you probably know a bit more about the attribute than in the case of an element that's not in a namespace. It's either going to be associated with that element's type or, if you've enabled reporting of namespace prefixes, it'll be a namespace declaration. (That's required by the Namespaces in XML specification, but DOM and the XML Infoset have chosen instead to put such declarations into a namespace.) Your code might look something like this:
Attributeatts = ...;
intlength = atts.getLength ();
for (int i = 0; i < length; i++) {
String uri = atts.getURI (i);
if ("".equals (uri)) {
String qName = atts.getQName (i);
// ... then dispatch based on qName
// including error based on unrecognized name
// "xmlns" and "xmlns:*" declarations would appear here
} else if ("http://www.example.com/namespaces/ns1".equals (uri)) {
String localName = atts.getLocalName (i);
// ... then dispatch based on "localName"
// including error based on unrecognized name
... and similarly for all other supported attribute namespaces
} else
error ("unrecognized attribute namespace: " + uri);
}
If your code uses idioms like those shown above, it'll be handling namespaces correctly. Otherwise, you're likely to run into a document or parser that confuses your code. Don't try to ignore namespaces completely. If your code wants a simpler "pre-namespaces" view of the world, at least make sure the namespace URI is always empty and report errors for all elements and attributes where that's not true.
|
If you've never set up a parser to validate, like
parser.setFeature ( "http://xml.org/sax/features/validation", true);
and then been surprised when you didn't get any error reports,
congratulations! You're in the minority. Most developers have
forgotten (and usually more than once) that by default, validity
errors are ignored by SAX parsers. You need to call
XMLReader.setErrorHandler() with some useful error
handler to make validity errors have any effect at all. That handler
needs to do something interesting with the validity errors reported to
it using error() calls.
It's worth having a good utility class that you reuse and reconfigure it to handle this particular situation. It'll be handy even when you're not validating. Such a class might look like
class MyErrorHandler implements ErrorHandler
{
private void print (String label, SAXParseException e)
{
System.err.println ("** " + label + ": " + e.getMessage ());
System.err.println (" URI = " + e.getSystemId ());
System.err.println (" line = " + e.getLineNumber ());
}
booleanreportErrors = true;
booleanabortOnError = false;
// for recoverable errors, like validity problems
public void error (SAXParseException e)
throws SAXException
{
if (reportErrors)
print ("error", e);
if (abortOnErrors)
throw e;
}
... plus similar for fatalError(), warning()
... and maybe more interesting configuration support
}
A SAX ErrorHandler should know two policies for each of its three fault classes: whether to report such faults, and whether such faults should terminate parsing. Various mechanisms can be used to report the fault, such as logging, adding text to a Swing message window, or just printing. At this time, SAX doesn't support portable mechanisms to identify particular failure modes, so that you can't really consider "why did it fail?" in the handler.
ErrorHandler
between the XMLReader and your own Handlers
|
|
| Post your comments |
When your application uses the same
ErrorHandler in its own handlers and for the parser, it creates an
integrated stream of fault information. That's useful in its own
right, but the best part is that all the errors (and warnings) can
then be handled according to the same policy and mechanism. You can
easily change how faults are handled by switching or reconfiguring
that ErrorHandler object. In most cases, the SAX fault
classifications are fine, since having more than
fatalError(), error(), and
warning() will rarely be helpful. Here's how you might
set this up for a simple handler:
public class MyHandler implements ContentHandler
{
// doesn't matter if this stays as null, since
// SAXParseException constructors don't care
private Locatorlocator;
public void setDocumentLocator (Locator l)
{ locator = l; }
// application and SAX errors should use the same handler,
private ErrorHandlereh = new DefaultHandler ();
public void setErrorHandler (ErrorHandler e)
{ eh = (e == null) ? new DefaultHandler () : e; }
// simpler is usually better ...
public final void error (String message) throws SAXException
{ eh.error (new SAXParseException (message, locator)); }
public final void warning (String message) throws SAXException
{ eh.warning (new SAXParseException (message, locator)); }
public final void fatalError (String message) throws SAXException
{
SAXParseException e = new SAXParseException (message, locator);
eh.fatalError (e);
// in case eh tries to continue: we can't, and won't
throw e;
}
// the real application code would use error(String) and friends
// to report errors, something like this:
public void endElement (String uri, String localName, String qName)
throws SAXException
{
... branch to figure out which element's processing to do ...
if (processData (getCharacters ()) != true) {
error ("bad '" + localName + "' content");
// recover from it (clean up state) and
return;
}
... now repackage and save all the object's state
}
... lots more code
}
Then you should initialize both the XMLReader and your
content handlers (including any that process DTD content) to use the
same ErrorHandler. The SAX ErrorHandler interface is flexible enough
to use as a general error handling policy interface in much of your
XML code. In fact, you may have noticed that the
javax.xml.parsers.DocumentBuilder class uses one to
simplify error reporting when building a DOM Document.
If you want, your application can subclass
SAXParseException to provide some application-specific
exception information, which might be understood by that error
handler. It might use information about what happened to make more
enlightened decisions about how to handle the problem.
Once developers get past the initial milestone of learning how SAX parser callbacks map to the input text, the next step is to figure out how to turn such a stream of callbacks into application data. Certainly SAX is low overhead, and no other API is likely to get less in the way. At the same time, SAX is not exactly going out of its way to package things neatly. It's the very fact that SAX doesn't pick data structures for you that makes it so powerful. That can take getting used to, particularly if you're used to thinking in terms of structures that someone else designed.
A good place to start is to make a ContentHandler
implementation that keeps important information in a stack. For
example, you could define a class that records an element name (with
its namespace, if any) and uses the AttributesImpl class
to snapshot its associated attributes. If you create those entries in
startElement() and stack them, any callback could use
that information before endElement() popped the stack.
Certain attributes, including xml:base,
xml:lang, and xml:space, are in a sense
"inherited", and you might need to walk up that stack to find such a
value while processing other event callbacks.
Such stack entries are also convenient places to collect application-specific information about an element's children. For example, you might be unmarshaling a series of data elements, converting them from strings into more specialized data types as you parse. You'd store those converted values in members of that special stack entry, reporting application level errors when they're detected. Periodically you could transform such entries (or subtrees of entries) into custom data structures that might no longer reflect the way XML text happened to encode that data.
Of course if you track every data item that comes in through SAX, you're starting down a well trodden path. There are plenty of APIs that do that, optimized for one model or another but likely not for your particular application. Still, it can be good fun and useful to build up SAX infrastructure for your application that way.
InputSource to wrap in-memory data
New SAX programmers often end up with some
data in memory, perhaps in a string or other data buffer, that needs
to be parsed as XML. (Maybe it came from a database or was built by
some other program component.) It's easy to use SAX to parse these,
since the java.io package provides classes that let you
create character streams from character data. You can use
CharArrayReader to read from arrays of characters, or
StringReader as shown here when the data starts as a
string:
Readerreader;
InputSourcein;
XMLReaderparser;
reader = new StringReader ("<bank name='Gringott's' box='713'/>");
in = new InputSource (reader);
parser = XMLReaderFactory.createXMLReader ();
parser.parse (in);
You can do similar things with byte arrays, using the
ByteArrayInputStream class to create a byte stream, but
in that case you've got to be careful about character encoding issues.
It's best if those bytes are UTF-8 encoded XML data.
Such input sources can be used as direct parser inputs (as shown here) or, if you're using DTDs and entities defined in them, through an EntityResolver.
EntityResolver
XML uses external entities to support document modularity; they are available if you're using DTDs. When a document references an entity, parsers normally fetch it and parse the result. That's exactly what you need in most cases, but it causes problems when the server hosting that URL goes offline for a while (or maybe it was your client that wanted to be disconnected?), and when the network is unreliable. Your whole application could become unavailable, just because it's trying to get a resource that can't be gotten.
How can you avoid entity access problems? SAX2 gives you two basic controls over entity processing.
First, two SAX2 feature flags control whether external entities are
ever fetched. One affects parameter entities (like %module;)
which are used inside the DTD. The other affects general entities
(like &data;) in the body of the document. Most SAX
parsers don't let you turn of this fetching, but if you're using one
which does, this may be a fine solution. (The current Ælfred2
release supports this, but I don't know another SAX2 parser that
does.) So you may not be able to use this facility.
Second, you can use an EntityResolver to control how
entities are resolved. Whenever a SAX parser needs to access an
external entity, it will ask the resolveEntity() method
on your resolver how to handle that entity. That method sees the
entity's fully resolved URI and, if it had one, its public ID. (A new
SAX extension is in the works to provide more information, but it's
not widely supported yet.) Some interesting things for that method to
do include:
Map public IDs to local file names. That's what public IDs were designed for, and hashtables were designed for such mappings. Strongly encouraged! You can do the same thing for system IDs. (There are also "catalog" systems to help manage such mappings. You may want to use a resolver that knows how to use one.)
Fetch or compute the data,
maybe using a database.
If you're using a private URI scheme that your
JVM doesn't understand,
maybe blob:database-name:database-key,
you'll probably want to store those in the public IDs and
do the URI resolution yourself.
Construct an empty input source and return that.
This is safe to do for general entities, after the first
startElement(), and a bit dangerous for
parameter entities, but you may be better off trying to skip
some remote entities than trying to access them.
(The issue with handling parameter entities this way is that the
parser won't know it didn't see their declarations, and so
it won't behave correctly.)
A simple entity resolver might look like this for an application that's really paranoid about preventing access to all entities it doesn't control. If you were using it, you'd probably preload the hashtable with entries for all of your application's entities. And you'd probably apply intelligence about what requests are really unsafe or your customers would get unhappy. For example, maybe string prefix matches would be used to grant access to certain files inside the firewall (or its DMZ), and only the ones outside that security boundary would be airbrushed out of the picture.
class MyResolver implements EntityResolver
{
private Hashtablepublics, systems;
MyResolver (Hashtable pub, Hashtable sys)
{ publics = pub; systems = sys; }
public InputSource resolveEntity (String publicId, String systemId)
throws IOException, SAXException
{
InputSourceretval = null;
if (publicId != null) {
String value = (String) publics.get (publicId);
if (value != null) {
// use new system ID and original public ID
retval = new InputSource (retval);
retval.setPublicId (publicId);
}
}
if (retval == null) {
String value = (String) systems.get (systemId);
if (value != null) {
// use new system ID and original public ID
retval = new InputSource (retval);
retval.setPublicId (publicId);
}
}
if (retval == null) {
// we're sooo paranoid here!!
System.err.println ("RESOLVER: punt " + systemId + " "
+ (publicId == null ? "" : publicId));
retval = new InputSource (new StringReader (""));
retval.setSystemId (systemId);
retval.setPublicId (publicId);
}
// if we returned null, the systemId would would
// be dereferenced using standard URL handling.
return retval;
}
}
A good rule of thumb is always to use a resolver for any application that reuses a known set of DTDs. Do it, if for no other reason than to avoid accessing the network when you don't need to. Only mission critical servers would likely want to be as paranoid as shown above.
SAX is made for streaming processing, and the best way to stream your processing is to connect a series of processing components into an event pipeline. One component produces events, the next consumes them and produces new (or maybe filtered) events for yet another component to consume. Often, both your CPU and I/O subsystems can be working on different parts of the pipeline at the same time, minimizing elapsed time.
SAX parsers produce events, but they're not the only way to produce a stream of SAX events. One common practice is to have programs call the SAX event methods directly, perhaps while walking over a data structure as part of converting it to XML. SAX2 defines a way to make a SAX parser that walks a DOM tree, rather than XML text, emitting a stream of SAX events. And toolsets like DOM4J and JDOM haven't neglected such data-to-SAX converters, either. Think of that SAX event stream as an efficient in-memory version of the generic transfer syntax which XML provides between different processes.
Your "ultimate consumer" in a SAX event pipeline could write XML
text out (use one of the various XMLWriter classes) or
turn the events into a application-optimized data structure. It's
easy to build a DOM (or DOM4J, or JDOM) model from a modified SAX
event stream, too. And since you have control over what happens, you
don't have to build the entire generic tree structure before you begin
processing it; if you do it that way, you can garbage collect each
chunk of data as soon as you're done processing it, rather than
waiting for the whole document to materialize in memory.
If you're using XSLT in Java, you may well be familiar with the
javax.xml.transform.sax (TRAX) package. XSLT engines
such as SAXON or Xalan support it. You may not know that it's easy to
feed SAX events as inputs to an XSLT engine as a SAX pipeline stage,
using a TransformerHandler,or to collect XSLT engine
output as SAX events using a SAXResult. SAX events in,
transformation according to XSLT, and then SAX events out again: those
TRAX APIs are essentially wrappers around SAX pipeline stages! It can
be very worthwhile to unwrap them and use XSLT for some heavier weight
transformations in your SAX pipelines.
I could go on about pipelines, but I'll just mention that SAX2
includes a XMLFilterImpl class, handy for writing some
kinds of intermediate pipeline stages, and stop. Pipelines are
covered in more detail in that new book
that I mentioned. The main thing to remember is that event pipelines
are the natural model for components in SAX. You should plan to use
them if you're doing anything very substantial.
If you've read this far, you deserve a special bonus tip. SAX has its own site, http://www.saxproject.org. Visit it site for the the latest information updated documentation about SAX.
David Brownell, author of SAX2, is a software engineer. He recently worked for three years at JavaSoft, where he provided Sun's XML and DOM software, SSL and public key technologies, the original version of the JavaServer Pages technology, and worked on the Java Servlet API for Web servers.O'Reilly & Associates will soon release (January 2002) SAX2.
You can also look at the Full Description of the book.
For more information, or to order the book, click here.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.