The XMLPULL API
August 14, 2002
Elliotte Rusty Harold is coauthor of XML in a Nutshell, 2nd Edition.Most XML APIs are either event-based like SAX and XNI or tree-based APIs like DOM, JDOM, and dom4j. Most programmers find tree-based APIs to be easier to use, but they are less efficient, especially when it comes to memory usage. A typical in-memory tree is several times larger than the document it models. These APIs are normally not practical for documents larger than a few megabytes in size or in memory-constrained environments. In these situations, a streaming API such as SAX or XNI is normally chosen. However, these APIs model the parser rather the document. They push the content of the document to the client application as soon as they see it, whether the client is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require are unfamiliar and uncomfortable to many developers.
XMLPULL is a new streaming API that can read arbitrarily large documents like SAX. However, as the name indicates, it is based on a pull model rather than a push model. In XMLPULL the client is in control rather than the parser. The application tells the parser when it wants to receive the next data chunk rather than the parser telling the client when the next chunk of data is available.
Like SAX, XMLPULL is an open source, parser independent pure Java API based on interfaces that can be implemented by multiple parsers. Currently there are two implementations, both free:
The API defines only one class, one interface, and one exception:
-
XmlPullParser
: an abstract class that represents the parser -
XmlPullParserFactory: the factory class that instantiates an implementation dependent subclass of
XmlPullParser
-
XmlPullException
: the generic class for everything other than anIOException
that might go wrong when parsing an XML document, particularly well-formedness errors and tokens that don't have the expected type
Most XMLPULL programs begin by using the factory class to load a parser:
XmlPullParserFactory factory = XmlPullParserFactory.newInstance(); XmlPullParser parser = factory.newPullParser();
If anything goes wrong with this, then an XmlPullParserException
is
thrown.
Next, the parser is pointed at a particular input stream with a certain encoding. For example,
URL u = new URL("http://www.cafeconleche.org/"); InputStream in = u.openStream(); parser.setInput(in, "ISO-8859-1");
If you don't know the encoding, you can pass null and the parser will try to guess it from the input stream based on the usual clues like the byte order mark and the encoding declaration.
Now it's time to actually read the document. You can think of the
XmlPullParser
as an iterator across all the different tags, text nodes, and
other information items in the XML document. You invoke its nextToken()
method
to advance from one token to the next, and then use various getter
methods to
extract data from that chunk. Some of the most important of these include:
getEventType() getName() getNamespace() getPrefix() getText() getAttributeCount() getAttributeName(int index) getAttributeNamespace(int index) getAttributePrefix(int index) getAttributeType(int index) getAttributeValue(int index) getAttributeValue(String namespace, String name)
Not all of these methods work all the time. For instance, if the XmlPullParser
is positioned on an end-tag then you can get the name, namespace, and prefix but not
the
attributes or the text. If the XmlPullParser
is positioned on a text node, then
you can get the text but not the name, namespace, prefix, or attributes. Text nodes
just
don't have these things. To find out what kind of node the parser is currently positioned
on, you call the getEventType()
method. This returns one of these eleven
int
constants:
XmlPullParser.START_DOCUMENT XmlPullParser.CDSECT XmlPullParser.COMMENT XmlPullParser.DOCDECL XmlPullParser.START_TAG XmlPullParser.END_TAG XmlPullParser.ENTITY_REF XmlPullParser.IGNORABLE_WHITESPACE XmlPullParser.PROCESSING_INSTRUCTION XmlPullParser.TEXT XmlPullParser.END_DOCUMENT
For example, here's a simple bit of code that iterates through an XML document and prints out the names of the different elements it encounters:
while (true) { int event = parser.next(); if (event == XmlPullParser.END_DOCUMENT) break; if (event == XmlPullParser.START_TAG) { System.out.println(parser.getName()); } }
Here's the start of the output when I ran this across a simple well-formed HTML file:
html head title meta meta script body div ...
If you're only concerned with tags, text, and documents, you can use the
next()
method instead of nextToken()
. This method silently skips
all comments, processing instructions, document-type declarations, and ignorable white
space. It merges CDATA sections and entities into their surrounding text. Unresolvable
entities cause an XmlPullParserException
. Thus, the kinds of events it reports
are only START_DOCUMENT
, START_TAG
, END_TAG
,
TEXT
, and END_DOCUMENT
.
For a slightly more realistic example, consider an outliner program that reads through an XHTML document and prints out the contents of all the heading elements: h1, h2, h3, and so forth.
import org.xmlpull.v1.*; import java.net.URL; import java.io.IOException; public class XHTMLOutliner { public static void main(String[] args) { if (args.length == 0) { System.err.println("Usage: java XHTMLOutliner url" ); return; } String input = args[0]; try { XmlPullParserFactory factory = XmlPullParserFactory.newInstance(); XmlPullParser parser = factory.newPullParser(); URL u = new URL(input); parser.setInput(u.openStream(), null); boolean inHeader = false; while (true) { int event = parser.next(); if (event == XmlPullParser.START_TAG) { if (isHeader(parser.getName())) { inHeader = true; } } else if (event == XmlPullParser.END_TAG) { if (isHeader(parser.getName())) { inHeader = false; System.out.println(); } } else if (event == XmlPullParser.TEXT) { if (inHeader) System.out.print(parser.getText()); } else if (event == XmlPullParser.END_DOCUMENT) break; } } catch (XmlPullParserException e) { System.out.println(e); } catch (IOException e) { System.out.println("IOException while parsing " + input); } } /** * Determine if this is an XHTML heading element or not * @param String name: tag name * @return boolean true if this is h1, h2, h3, h4, h5, or h6; false * otherwise */ private static boolean isHeader(String name) { if (name.equals("h1")) return true; if (name.equals("h2")) return true; if (name.equals("h3")) return true; if (name.equals("h4")) return true; if (name.equals("h5")) return true; if (name.equals("h6")) return true; return false; } }
This program has a couple of potential bugs in edge cases. First of all, it will fail if any headers are nested; for instance, if an h1 element contains an h2 element as in
<h1>This <h2>invalid</h2> example</h1>.
Technically this is invalid XHTML, but it is not malformed. You can turn on validation
for
documents by passing true to the factory's setValidating() method before instantiating
the
parser. While we're at it, we should probably turn on namespace support too, using
the
setNamespaceAware()
method:
factory.setValidating(true); factory.setNamespaceAware(true);
Unfortunately, neither of the currently available XMLPULL parsers can validate so this doesn't actually work. They do support namespaces, though surprisingly namespace support is turned off by default.
This simple example doesn't demonstrate the full power of the XMLPULL API. Since the client application controls the process, it's easy to write separate methods for different elements. These methods can have detailed knowledge of the internal structure of the type of element they handle. For example, we could write one method that handles headers, one that handles img elements, one that handles tables, one that handles meta tags, and so forth. For example, you might process an HTML document that contains a header and a body like this:
public void processHtml(XmlPullParser parser) { while (true) { int event = parser.nextToken(); if (event == XmlPullParser.START_TAG) { if (parser.getName().equals("head")) processHead(parser); else if (parser.getName().equals("body")) processBody(parser) } else if (event == XmlPullParser.END_TAG) { // </html> return; } } }
Here I'm making a lot of assumptions about exactly which tags show up where and when. This isn't unusual in XML processing . Most applications are designed with particular vocabularies in mind. You wouldn't expect an XHTML outliner to know what to do with a DocBook document, much less an SVG picture, for example. However, it is best to test and verify your expectations about data formats. Normally, this would be done through validation. Pull parsers don't yet support validation, but XMLPULL offers an alternative. If you expect a particular token to be present in the document, you can require it using a type and an optional name and namespace. For example, if I think that the current token is an XHTML <head> start-tag, I'd require it thusly:
parser.require(XmlPullParser.START_TAG, "http://www.w3.org/1999/xhtml", "head");
If my expectation proves wrong, then the require()
method throws
XmlPullParserException
, a checked exception. You can pass null for either the
namespace or the element name to indicate that all namespaces and/or names are
acceptable.
We can expand the chance of this working by using the nextTag()
method instead
of the nextToken()
method. nextTag()
skips over comments, entity
references, processing instructions, whitespace-only text nodes, and other non-tag
nodes. It
does throw an XmlPullParserException
if it encounters unexpected non-whitespace
text. Putting this all together, the general pattern might be
try { parser.nextTag(); parser.require(XmlPullParser.START_TAG, "http://www.w3.org/1999/xhtml", "head"); processHead(parser); } catch (XmlPullParserException e) { // Oops! The head was missing! }
Summing Up
XMLPULL can be a fast, simple, and memory-thrifty means of loading data from an XML document whose structure is well known in advance. State management is much simpler in XMLPULL than in SAX, so if you find that the SAX logic is just getting way too complex to follow or debug, then XMLPULL might be a good alternative. However, because the existing XMLPULL parsers don't support validation, robustness requires adding a lot of validation code to the program that would not be necessary in the SAX or DOM equivalent. This is probably only worthwhile when the DOM equivalent program would use too much memory. Otherwise, a validating DOM program will be much more robust. The other thing that might indicate choosing XMLPULL over DOM would be a situation in which streaming was important; that is, you want to begin generating output from the input almost immediately without waiting for the entire document to be read.
However, in my opinion XMLPULL is not yet suitable as a general purpose Java API for processing XML. It should not be your first choice for most applications. In particular, XMLPULL has two major flaws:
-
The API does not model XML correctly.
-
The API is not object oriented.
These are two very big problems. With respect to XML, XMLPULL does not support namespaces by default and does not read or report well-formedness errors in the internal DTD subset. The namespace flaw can be fixed by setting the appropriate feature, and in theory the internal DTD subset problem can be as well. But the existing parsers don't support this. Furthermore, the defaults are exactly backwards from what they should be for both; and while there might rarely be justification for turning off namespace processing, turning off processing of the internal DTD subset is simply not allowed by the XML specification. A parser that does not read the internal DTD subset is not an XML parser.
The object problems are less fundamentally wrong but still extremely troubling. XMLPULL
has
far too few classes. The prevalence of switch statements and stacks of if-else-if
blocks
just to test the return type of the nextToken()
method is a classic symptom of
failure to take advantage of polymorphism. Another hint that something is seriously
wrong
here is the number of state-dependent methods that only work when the parser is positioned
on a particular kind of token. Still another clue is the use of int
type
constants instead of a class hierarchy. The next()
, nextTag()
, and
nextToken()
methods should all return instances of a common Token superclass.
Many methods in XmlPullParser
could be moved into this class. The whole API
smells of procedural code and so doesn't fit very well into object-oriented Java
designs.
Regrettably, the XMLPULL designers seem very committed to the current API. These problems are not casual bugs. They are deliberate design decisions, based on a desire to reduce the footprint of XMLPULL to the minimum possible for J2ME environments. None of these problems are likely to be fixed in the future. The trade-offs made in the name of size may be acceptable if you're working in J2ME. They are completely unacceptable in a desktop or server environment. Thus XMLPULL seems destined to remain a niche API for developers seeking efficiency at all costs.
Nonetheless, there are some interesting ideas here. Most importantly, the problems I've identified stem from implementation issues, not from anything fundamental to a pull-based model for XML processing. A future pull-API that learned from XMLPULL's mistakes could easily become a real alternative to SAX and DOM.