The XMLPULL API
by Elliotte Rusty Harold
|
Pages: 1, 2
For a slightly more realistic example, consider an outliner program that reads through an XHTML document and prints out the contents of all the heading elements: h1, h2, h3, and so forth.
import org.xmlpull.v1.*;
import java.net.URL;
import java.io.IOException;
public class XHTMLOutliner {
public static void main(String[] args) {
if (args.length == 0) {
System.err.println("Usage: java XHTMLOutliner url" );
return;
}
String input = args[0];
try {
XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
XmlPullParser parser = factory.newPullParser();
URL u = new URL(input);
parser.setInput(u.openStream(), null);
boolean inHeader = false;
while (true) {
int event = parser.next();
if (event == XmlPullParser.START_TAG) {
if (isHeader(parser.getName())) {
inHeader = true;
}
}
else if (event == XmlPullParser.END_TAG) {
if (isHeader(parser.getName())) {
inHeader = false;
System.out.println();
}
}
else if (event == XmlPullParser.TEXT) {
if (inHeader) System.out.print(parser.getText());
}
else if (event == XmlPullParser.END_DOCUMENT) break;
}
}
catch (XmlPullParserException e) {
System.out.println(e);
}
catch (IOException e) {
System.out.println("IOException while parsing " + input);
}
}
/**
* Determine if this is an XHTML heading element or not
* @param String name: tag name
* @return boolean true if this is h1, h2, h3, h4, h5, or h6; false
* otherwise
*/
private static boolean isHeader(String name) {
if (name.equals("h1")) return true;
if (name.equals("h2")) return true;
if (name.equals("h3")) return true;
if (name.equals("h4")) return true;
if (name.equals("h5")) return true;
if (name.equals("h6")) return true;
return false;
}
}
This program has a couple of potential bugs in edge cases. First of all, it will fail if any headers are nested; for instance, if an h1 element contains an h2 element as in
<h1>This <h2>invalid</h2> example</h1>.
Technically this is invalid XHTML, but it is not malformed. You can
turn on validation for documents by passing true to the factory's
setValidating() method before instantiating the parser. While we're at it,
we should probably turn on namespace support too, using the
setNamespaceAware() method:
factory.setValidating(true);
factory.setNamespaceAware(true);
Unfortunately, neither of the currently available XMLPULL parsers can validate so this doesn't actually work. They do support namespaces, though surprisingly namespace support is turned off by default.
This simple example doesn't demonstrate the full power of the XMLPULL API. Since the client application controls the process, it's easy to write separate methods for different elements. These methods can have detailed knowledge of the internal structure of the type of element they handle. For example, we could write one method that handles headers, one that handles img elements, one that handles tables, one that handles meta tags, and so forth. For example, you might process an HTML document that contains a header and a body like this:
public void processHtml(XmlPullParser parser) {
while (true) {
int event = parser.nextToken();
if (event == XmlPullParser.START_TAG) {
if (parser.getName().equals("head")) processHead(parser);
else if (parser.getName().equals("body")) processBody(parser)
}
else if (event == XmlPullParser.END_TAG) { // </html>
return;
}
}
}
Here I'm making a lot of assumptions about exactly which tags show up where and when. This isn't unusual in XML processing . Most applications are designed with particular vocabularies in mind. You wouldn't expect an XHTML outliner to know what to do with a DocBook document, much less an SVG picture, for example. However, it is best to test and verify your expectations about data formats. Normally, this would be done through validation. Pull parsers don't yet support validation, but XMLPULL offers an alternative. If you expect a particular token to be present in the document, you can require it using a type and an optional name and namespace. For example, if I think that the current token is an XHTML <head> start-tag, I'd require it thusly:
parser.require(XmlPullParser.START_TAG,
"http://www.w3.org/1999/xhtml",
"head");
If my expectation proves wrong, then the require() method
throws XmlPullParserException, a checked exception. You can
pass null for either the namespace or the element name to indicate that
all namespaces and/or names are acceptable.
We can expand the chance of this working by using the
nextTag() method instead of the nextToken()
method. nextTag() skips over comments, entity references,
processing instructions, whitespace-only text nodes, and other non-tag
nodes. It does throw an XmlPullParserException if it
encounters unexpected non-whitespace text. Putting this all together, the
general pattern might be
try {
parser.nextTag();
parser.require(XmlPullParser.START_TAG,
"http://www.w3.org/1999/xhtml",
"head");
processHead(parser);
}
catch (XmlPullParserException e) {
// Oops! The head was missing!
}
Summing Up
XMLPULL can be a fast, simple, and memory-thrifty means of loading data from an XML document whose structure is well known in advance. State management is much simpler in XMLPULL than in SAX, so if you find that the SAX logic is just getting way too complex to follow or debug, then XMLPULL might be a good alternative. However, because the existing XMLPULL parsers don't support validation, robustness requires adding a lot of validation code to the program that would not be necessary in the SAX or DOM equivalent. This is probably only worthwhile when the DOM equivalent program would use too much memory. Otherwise, a validating DOM program will be much more robust. The other thing that might indicate choosing XMLPULL over DOM would be a situation in which streaming was important; that is, you want to begin generating output from the input almost immediately without waiting for the entire document to be read.
However, in my opinion XMLPULL is not yet suitable as a general purpose Java API for processing XML. It should not be your first choice for most applications. In particular, XMLPULL has two major flaws:
The API does not model XML correctly.
The API is not object oriented.
These are two very big problems. With respect to XML, XMLPULL does not support namespaces by default and does not read or report well-formedness errors in the internal DTD subset. The namespace flaw can be fixed by setting the appropriate feature, and in theory the internal DTD subset problem can be as well. But the existing parsers don't support this. Furthermore, the defaults are exactly backwards from what they should be for both; and while there might rarely be justification for turning off namespace processing, turning off processing of the internal DTD subset is simply not allowed by the XML specification. A parser that does not read the internal DTD subset is not an XML parser.
The object problems are less fundamentally wrong but still extremely
troubling. XMLPULL has far too few classes. The prevalence of switch
statements and stacks of if-else-if blocks just to test the return type of
the nextToken() method is a classic symptom of failure to
take advantage of polymorphism. Another hint that something is seriously
wrong here is the number of state-dependent methods that only work when
the parser is positioned on a particular kind of token. Still another clue
is the use of int type constants instead of a class
hierarchy. The next(), nextTag(), and
nextToken() methods should all return instances of a common
Token superclass. Many methods in XmlPullParser could be
moved into this class. The whole API smells of procedural code and so
doesn't fit very well into object-oriented Java designs.
Regrettably, the XMLPULL designers seem very committed to the current API. These problems are not casual bugs. They are deliberate design decisions, based on a desire to reduce the footprint of XMLPULL to the minimum possible for J2ME environments. None of these problems are likely to be fixed in the future. The trade-offs made in the name of size may be acceptable if you're working in J2ME. They are completely unacceptable in a desktop or server environment. Thus XMLPULL seems destined to remain a niche API for developers seeking efficiency at all costs.
Nonetheless, there are some interesting ideas here. Most importantly, the problems I've identified stem from implementation issues, not from anything fundamental to a pull-based model for XML processing. A future pull-API that learned from XMLPULL's mistakes could easily become a real alternative to SAX and DOM.
- .NET XmlReader
2002-08-14 18:37:08 Kristoffer Sheather