XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

The XMLPULL API

August 14, 2002

Elliotte Rusty Harold is coauthor of XML in a Nutshell, 2nd Edition.

Most XML APIs are either event-based like SAX and XNI or tree-based APIs like DOM, JDOM, and dom4j. Most programmers find tree-based APIs to be easier to use, but they are less efficient, especially when it comes to memory usage. A typical in-memory tree is several times larger than the document it models. These APIs are normally not practical for documents larger than a few megabytes in size or in memory-constrained environments. In these situations, a streaming API such as SAX or XNI is normally chosen. However, these APIs model the parser rather the document. They push the content of the document to the client application as soon as they see it, whether the client is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require are unfamiliar and uncomfortable to many developers.

XMLPULL is a new streaming API that can read arbitrarily large documents like SAX. However, as the name indicates, it is based on a pull model rather than a push model. In XMLPULL the client is in control rather than the parser. The application tells the parser when it wants to receive the next data chunk rather than the parser telling the client when the next chunk of data is available.

Related Reading

XML in a Nutshell, 2nd Edition

XML in a Nutshell, 2nd Edition
A Desktop Quick Reference
By W. Scott Means, Elliotte Rusty Harold

Like SAX, XMLPULL is an open source, parser independent pure Java API based on interfaces that can be implemented by multiple parsers. Currently there are two implementations, both free:

The API defines only one class, one interface, and one exception:

  • XmlPullParser: an abstract class that represents the parser

  • XmlPullParserFactory: the factory class that instantiates an implementation dependent subclass of XmlPullParser

  • XmlPullException: the generic class for everything other than an IOException that might go wrong when parsing an XML document, particularly well-formedness errors and tokens that don't have the expected type

Most XMLPULL programs begin by using the factory class to load a parser:


XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
XmlPullParser parser = factory.newPullParser();

If anything goes wrong with this, then an XmlPullParserException is thrown.

Next, the parser is pointed at a particular input stream with a certain encoding. For example,


URL u = new URL("http://www.cafeconleche.org/");
InputStream in = u.openStream();
parser.setInput(in, "ISO-8859-1");

If you don't know the encoding, you can pass null and the parser will try to guess it from the input stream based on the usual clues like the byte order mark and the encoding declaration.

Now it's time to actually read the document. You can think of the XmlPullParser as an iterator across all the different tags, text nodes, and other information items in the XML document. You invoke its nextToken() method to advance from one token to the next, and then use various getter methods to extract data from that chunk. Some of the most important of these include:


getEventType()
getName()
getNamespace()
getPrefix()
getText()
getAttributeCount()
getAttributeName(int index)
getAttributeNamespace(int index)
getAttributePrefix(int index)
getAttributeType(int index)
getAttributeValue(int index)
getAttributeValue(String namespace, String name)

Not all of these methods work all the time. For instance, if the XmlPullParser is positioned on an end-tag then you can get the name, namespace, and prefix but not the attributes or the text. If the XmlPullParser is positioned on a text node, then you can get the text but not the name, namespace, prefix, or attributes. Text nodes just don't have these things. To find out what kind of node the parser is currently positioned on, you call the getEventType() method. This returns one of these eleven int constants:


XmlPullParser.START_DOCUMENT
XmlPullParser.CDSECT
XmlPullParser.COMMENT
XmlPullParser.DOCDECL
XmlPullParser.START_TAG
XmlPullParser.END_TAG
XmlPullParser.ENTITY_REF
XmlPullParser.IGNORABLE_WHITESPACE
XmlPullParser.PROCESSING_INSTRUCTION
XmlPullParser.TEXT 
XmlPullParser.END_DOCUMENT

For example, here's a simple bit of code that iterates through an XML document and prints out the names of the different elements it encounters:


while (true) {
    int event = parser.next();
    if (event == XmlPullParser.END_DOCUMENT) break;
    if (event == XmlPullParser.START_TAG) {
        System.out.println(parser.getName());
    }
} 

Here's the start of the output when I ran this across a simple well-formed HTML file:


html
head
title
meta
meta
script
body
div
...

If you're only concerned with tags, text, and documents, you can use the next() method instead of nextToken(). This method silently skips all comments, processing instructions, document-type declarations, and ignorable white space. It merges CDATA sections and entities into their surrounding text. Unresolvable entities cause an XmlPullParserException. Thus, the kinds of events it reports are only START_DOCUMENT, START_TAG, END_TAG, TEXT, and END_DOCUMENT.

Pages: 1, 2

Next Pagearrow