Top Ten SAX2 Tips
December 05, 2001
If you write XML processing code in Java, or indeed most popular
programming languages, you will be familiar with SAX, the Simple API
for XML. SAX is the best API available for streaming XML processing,
for processing documents as they're being read. SAX is the most
flexible API in terms of data structures: since it doesn't force you
to use a particular in-memory representation, you can choose the one
you like best. SAX has great support for the XML Infoset
(particularly in SAX2, the second version of SAX) and better DTD
support than other widely available APIs. SAX2 is now part of JDK 1.4
and will soon be available to even more Java developers.
In this article, I'll highlight some points that can make your SAX
programming in Java more portable, robust, and expressive. Some of these
points are just advice, some address common programming problems, and some
address SAX techniques that offer unique power to developers. A new book
from O'Reilly, SAX2, addresses these topics and more. It details all the
SAX APIs and explains each feature in more detail than this short article
provides.
1. Keep it Simple
Despite being called the Simple API for XML,
things are often more complicated than they first appear. SAX has grown to
accommodate a lot of the flexibility needed by the tools and applications
that process XML, but when you start out with SAX you should first focus on
its underlying simplicity.
|
Related Reading
 |
SAX2
By David Brownell
January 2002 (est.)
240 pages (est.), $29.95 (est.)
|
|
Think of SAX2 (including its standardized extensions) as basically
including: one parser API, two handler interfaces for content, two
handler interfaces for DTD declarations (the best support of any
current Java parser API), and a bunch of other classes and interfaces.
Many applications can ignore most of that and start with just a few
classes and interfaces:
XMLReader is the basic
parser interface, and you get a parser object using
XMLReaderFactory.
DefaultHandler
has no-op implementations of the most popular handler
methods, which you can just override.
Attributes wraps up
the attributes of the elements reported to you.
You can write useful tools with just those APIs,
overriding only three methods in the DefaultHandler class:
startElement() when the parser reports the
beginning of an element and its attributes,
characters() to handle character data inside
such elements, and
endElement() to report the end of the element.
You shouldn't use any other functionality until your application
requires it. Some of the tips which follow describe common reasons to
use more features. Good error handling is right at the top of the list
of such reasons, and if you ever process documents with DTDs, smarter
handling of external entity resolution won't be far behind.
2. Buffer characters() calls
Just because a bunch of text looks to you like
it's one long set of characters doesn't mean that's how a SAX parser
will report it. You need to explicitly group characters that your
application thinks belong together. For example, consider this XML
fragment:
<asana>Vrichikasana — Scorpion</asana>
Certainly you'll see callbacks for the element boundaries, and for
the various characters. But how many callbacks will you see for those
characters? It'd be legal (but annoying) for the parser to report one
character per callback. More typically, parsers would report the
characters before and after the mdash entity reference
using one callback each, and also report the entity reference (plus
its contents, whatever they are). Some parsers that don't report the
entity reference would make only one characters()
callback for the whole thing, assuming the entity is the standard ISO
entity for Unicode character U+2014. There are even more legal ways
to report that simple block of text. Your event handler needs to work
with all of them.
The solution is to buffer up all the characters you receive in the
characters() callback. You could just append to a
String, but that's not particularly efficient. It's easier to use a
StringBuffer, since one StringBuffer.append() signature
is an exact match for the parameters in this callback, and it's easy
to turn those into a String later:
class MyHandler implements ContentHandler {
private StringBufferchars = new StringBuffer ();
public void characters (char buf [], int offset, int length)
throws SAXException
{ chars.append (buf, offset, length); }
private String getCharacters ()
{
String retval = chars.toString ();
chars.setLength (0);
return retval;
}
... lots more in this class!
}
And now the interesting question is: when to collect that set of
buffered characters to do something interesting with it? The answer
depends on what your application is doing, but it'll usually be in
endElement() or startElement(). Sometimes
you'll collect the characters when there's a
processingInstruction(), or, more rarely, when a
comment() is reported. As a rule, avoid treating CDATA
sections or entity expansions as if characters inside them were
somehow special. Such boundaries are primarily for authoring
convenience, and they shouldn't matter except to editor
applications.
One scenario that's easy to handle is what's sometimes called "data
elements" -- which contain text only and no other elements. (Their
DTD content model might be (#PCDATA).) When you know
that's what you're working with, collect the element's data in
endElement(). That transparently ignores things like
comments and PIs that might have been inside the element, as well as
any entity or CDATA section boundaries found there. It's harder to
give general rules for other kinds of content model, which is in part
why many people like to specify the data style of element rather than
allowing "mixed content" or using unrestricted content models like
ANY. When a startElement() call needs to indicate the
end of some text, your code can get complicated.
Remember that if you're using DTDs, you'll likely get some calls to
ignorableWhitespace() to report characters in "element
content" models. I usually like to just discard all such characters,
since they're known to be semantically meaningless. But sometimes
that's not an option, and the solution is instead to call
characters() with the ignorable whitespace characters.
The parameters are the same; you don't even need to reorder them.
public void ignorableWhitespace (char buf [], int offset, int length)
throws SAXException
{ characters (buf, offset, length); }
If you used only element content models and text-only content
models, it'd be easy to get all the useful text from a valid XML
document. It would be the content of "data elements" that you'd get
when endElement() is called or in attribute values from
startElement(). The rest would be ignorable whitespace,
which you'd ignore.
3. Use XMLReaderFactory for Bootstrapping
Don't hardwire your code to use a particular
SAX2 parser or to rely on features of a particular parser. Good
SAX-based systems build almost everything as layers over the
parser rather than using nonstandard features. In fact, the best
way to bootstrap a SAX2 parser hides what parser you're using: it's a
simple call to a helper class:
XMLReaderparser = XMLReaderFactory.createXMLReader ();
That gives you the "system default" parser. Which parser is that?
You can control that. The most reliable way is to specify the parser
name on the command line, using the org.xml.sax.driver
system property and the name of your parser to establish a particular
JVM-wide default. You can do it like
java -Dorg.xml.sax.driver=gnu.xml.aelfred2.XmlReader MyMainClass arg ...
Some current SAX2 distributions (SAX2 r2pre3 at this writing but
not JDK 1.4) include easier ways to control the SAX2 default. One way
is through a system resource that's accessed through your class
loader: the META-INF/services/org.xml.sax.driver
resource. That's sensitive to your class loader configuration; in
some cases that may be a feature. Such recent distributions also
expect redistributions (from parser suppliers) to include a
compiled-in "last gasp" default, which handles the case where none of
the other configuration mechanisms have been set up.
The following table gives the names of some widely used SAX2
parsers. You should avoid hardwiring such names into your source
code; instead use the parser configuration mechanisms to keep your
code free of parser dependencies. All of these are optionally
validating, except the one labeled non-validating, and most do quite
well on most XML conformance tests.
If you're still using a SAX1 parser, and setting the
org.xml.sax.parser system property to point to that
parser, the XMLReaderFactory will fall back to that class
if it can't find a native SAX2 parser implementation. You should
probably upgrade to a more current implementation, but meanwhile you
can continue to use your old one. It will be automagically wrapped in
a ParserAdapter by the SAX2 factory.
4. Check for empty Namespace URI Strings
Namespaces have caused a lot of grief for XML
developers. At first the use of namespace URIs as purely abstract
identifiers caused the confusion, since they looked like URLs that
would be used to fetch something (but nobody knew what). But it
didn't stop there. Even today reasonable people (along with the
applications and tools that they build) have very different
perspectives on what it means to be in a particular namespace. It
seems to be a rare month in which significant misunderstandings don't
crop up in some area of namespace handling.
There's only one basic thing that programmers can do with any
namespace URI: compare it to another one as a string. But not every
name in an XML document has a namespace URI, and names in namespaces
need to be handled differently from names that aren't in a namespace.
(You can rely on either the qName or the
localName to have a value, but not both. Either
name will be an empty string in some cases.) You might be tempted to
write code that assumes every XML element or attribute name is in a
namespace, but that just doesn't match real world data. One day
you'll get a document that's not quite as clean as you expect, and
your code will break.
Which means that when you're writing SAX2 code to look at element
or attribute names, you have to figure out whether there's even a
namespace name. When there isn't, the namespace URI is always passed
as an empty string. Once you know which kind of name you're working
with, you can figure out how to handle the element or attribute in
question. Inline code to do name-based dispatching should look
something like the following (for elements); notice that it doesn't
even know there's such a thing as a namespace prefix:
public void startElement (
String uri, String localName,
String qName, Attributes atts
) throws SAXException
{
// Handle elements not in any namespace
if ("".equals (uri)) {
// these only have "qName"
if ("dolce".equals (qName)) {
// ... handle "dolce"
} else if ("vita".equals (qName)) {
// ... handle "vita"
... and all other supported "no namespace" elements
} else
error ("unrecognized element name: " + qName);
// Then handle each supported namespace separately
} else if ("http://www.example.com/namespaces/ns1".equals (uri)) {
// these have a "localName" with no prefix
if ("free".equals (localName)) {
// ... handle "free"
} else if ("open".equals (localName)) {
// ... handle "open"
... and all other supported NS1 elements
} else
error ("unrecognized NS1 element name: " + localName);
... and similarly for all other supported element namespaces
} else
error ("unrecognized element namespace: " + uri);
}
Attributes might not need that kind of handling. Applications
often "know about" particular attributes, access them by name, and
just ignore any unrecognized attributes. If you're accessing
attribute values in that way, just make sure you use the right naming
convention, either Attributes.getValue(uri,local) or
Attributes.getValue(qName), and you should have no
problems.
Otherwise you'll be scanning all of an element's attributes.
You'll need to check whether each attribute is in a namespace, just
like you checked whether its element was in a namespace. If it's not
in a namespace, you probably know a bit more about the attribute than
in the case of an element that's not in a namespace. It's either
going to be associated with that element's type or, if you've enabled
reporting of namespace prefixes, it'll be a namespace declaration.
(That's required by the Namespaces in XML specification, but
DOM and the XML Infoset have chosen instead to put such declarations
into a namespace.) Your code might look something like this:
Attributeatts = ...;
intlength = atts.getLength ();
for (int i = 0; i < length; i++) {
String uri = atts.getURI (i);
if ("".equals (uri)) {
String qName = atts.getQName (i);
// ... then dispatch based on qName
// including error based on unrecognized name
// "xmlns" and "xmlns:*" declarations would appear here
} else if ("http://www.example.com/namespaces/ns1".equals (uri)) {
String localName = atts.getLocalName (i);
// ... then dispatch based on "localName"
// including error based on unrecognized name
... and similarly for all other supported attribute namespaces
} else
error ("unrecognized attribute namespace: " + uri);
}
If your code uses idioms like those shown above, it'll be handling
namespaces correctly. Otherwise, you're likely to run into a document
or parser that confuses your code. Don't try to ignore namespaces
completely. If your code wants a simpler "pre-namespaces" view of the
world, at least make sure the namespace URI is always empty and report
errors for all elements and attributes where that's not true.
|