Menu

The XMLPULL API

August 14, 2002

Elliotte Rusty Harold

Elliotte Rusty Harold is coauthor of XML in a Nutshell, 2nd Edition.

Most XML APIs are either event-based like SAX and XNI or tree-based APIs like DOM, JDOM, and dom4j. Most programmers find tree-based APIs to be easier to use, but they are less efficient, especially when it comes to memory usage. A typical in-memory tree is several times larger than the document it models. These APIs are normally not practical for documents larger than a few megabytes in size or in memory-constrained environments. In these situations, a streaming API such as SAX or XNI is normally chosen. However, these APIs model the parser rather the document. They push the content of the document to the client application as soon as they see it, whether the client is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require are unfamiliar and uncomfortable to many developers.

XMLPULL is a new streaming API that can read arbitrarily large documents like SAX. However, as the name indicates, it is based on a pull model rather than a push model. In XMLPULL the client is in control rather than the parser. The application tells the parser when it wants to receive the next data chunk rather than the parser telling the client when the next chunk of data is available.

Like SAX, XMLPULL is an open source, parser independent pure Java API based on interfaces that can be implemented by multiple parsers. Currently there are two implementations, both free:

The API defines only one class, one interface, and one exception:

  • XmlPullParser: an abstract class that represents the parser

  • XmlPullParserFactory: the factory class that instantiates an implementation dependent subclass of XmlPullParser

  • XmlPullException: the generic class for everything other than an IOException that might go wrong when parsing an XML document, particularly well-formedness errors and tokens that don't have the expected type

Most XMLPULL programs begin by using the factory class to load a parser:


XmlPullParserFactory factory = XmlPullParserFactory.newInstance();

XmlPullParser parser = factory.newPullParser();

If anything goes wrong with this, then an XmlPullParserException is thrown.

Next, the parser is pointed at a particular input stream with a certain encoding. For example,


URL u = new URL("http://www.cafeconleche.org/");

InputStream in = u.openStream();

parser.setInput(in, "ISO-8859-1");

If you don't know the encoding, you can pass null and the parser will try to guess it from the input stream based on the usual clues like the byte order mark and the encoding declaration.

Now it's time to actually read the document. You can think of the XmlPullParser as an iterator across all the different tags, text nodes, and other information items in the XML document. You invoke its nextToken() method to advance from one token to the next, and then use various getter methods to extract data from that chunk. Some of the most important of these include:


getEventType()

getName()

getNamespace()

getPrefix()

getText()

getAttributeCount()

getAttributeName(int index)

getAttributeNamespace(int index)

getAttributePrefix(int index)

getAttributeType(int index)

getAttributeValue(int index)

getAttributeValue(String namespace, String name)

Not all of these methods work all the time. For instance, if the XmlPullParser is positioned on an end-tag then you can get the name, namespace, and prefix but not the attributes or the text. If the XmlPullParser is positioned on a text node, then you can get the text but not the name, namespace, prefix, or attributes. Text nodes just don't have these things. To find out what kind of node the parser is currently positioned on, you call the getEventType() method. This returns one of these eleven int constants:


XmlPullParser.START_DOCUMENT

XmlPullParser.CDSECT

XmlPullParser.COMMENT

XmlPullParser.DOCDECL

XmlPullParser.START_TAG

XmlPullParser.END_TAG

XmlPullParser.ENTITY_REF

XmlPullParser.IGNORABLE_WHITESPACE

XmlPullParser.PROCESSING_INSTRUCTION

XmlPullParser.TEXT 

XmlPullParser.END_DOCUMENT

For example, here's a simple bit of code that iterates through an XML document and prints out the names of the different elements it encounters:


while (true) {

    int event = parser.next();

    if (event == XmlPullParser.END_DOCUMENT) break;

    if (event == XmlPullParser.START_TAG) {

        System.out.println(parser.getName());

    }

} 

Here's the start of the output when I ran this across a simple well-formed HTML file:


html

head

title

meta

meta

script

body

div

...

If you're only concerned with tags, text, and documents, you can use the next() method instead of nextToken(). This method silently skips all comments, processing instructions, document-type declarations, and ignorable white space. It merges CDATA sections and entities into their surrounding text. Unresolvable entities cause an XmlPullParserException. Thus, the kinds of events it reports are only START_DOCUMENT, START_TAG, END_TAG, TEXT, and END_DOCUMENT.

For a slightly more realistic example, consider an outliner program that reads through an XHTML document and prints out the contents of all the heading elements: h1, h2, h3, and so forth.




import org.xmlpull.v1.*;

import java.net.URL;

import java.io.IOException;

 

public class XHTMLOutliner {



  public static void main(String[] args) {

		

    if (args.length == 0) {

      System.err.println("Usage: java XHTMLOutliner url" );

	 return;	

    }

    String input = args[0];

		

    try {

      XmlPullParserFactory factory = XmlPullParserFactory.newInstance();

      XmlPullParser parser = factory.newPullParser();



      URL u = new URL(input);

      parser.setInput(u.openStream(), null);

        

      boolean inHeader = false;

      while (true) {

  	   int event = parser.next();

  	   if (event == XmlPullParser.START_TAG) {

    	     if (isHeader(parser.getName())) {

    	       inHeader = true;

    	     }

  	   }

  	   else if (event == XmlPullParser.END_TAG) {

    	     if (isHeader(parser.getName())) {

    		  inHeader = false;

    		  System.out.println();

        }

  	   }

  	   else if (event == XmlPullParser.TEXT) {

  	     if (inHeader) System.out.print(parser.getText());

  	   }

  	   else if (event == XmlPullParser.END_DOCUMENT) break;



      }

    }

    catch (XmlPullParserException e) {

       System.out.println(e);	

    }

    catch (IOException e) {

      System.out.println("IOException while parsing " + input);	

    }

		

  }



  /**

   * Determine if this is an XHTML heading element or not

   * @param String name: tag name

   * @return boolean true if this is h1, h2, h3, h4, h5, or h6; false 

   *                 otherwise

   */

	private static boolean isHeader(String name) {

		if (name.equals("h1")) return true;

		if (name.equals("h2")) return true;

		if (name.equals("h3")) return true;

		if (name.equals("h4")) return true;

		if (name.equals("h5")) return true;

		if (name.equals("h6")) return true;

		return false;

	}



}

This program has a couple of potential bugs in edge cases. First of all, it will fail if any headers are nested; for instance, if an h1 element contains an h2 element as in


<h1>This <h2>invalid</h2> example</h1>. 

Technically this is invalid XHTML, but it is not malformed. You can turn on validation for documents by passing true to the factory's setValidating() method before instantiating the parser. While we're at it, we should probably turn on namespace support too, using the setNamespaceAware() method:


factory.setValidating(true);

factory.setNamespaceAware(true);

Unfortunately, neither of the currently available XMLPULL parsers can validate so this doesn't actually work. They do support namespaces, though surprisingly namespace support is turned off by default.

This simple example doesn't demonstrate the full power of the XMLPULL API. Since the client application controls the process, it's easy to write separate methods for different elements. These methods can have detailed knowledge of the internal structure of the type of element they handle. For example, we could write one method that handles headers, one that handles img elements, one that handles tables, one that handles meta tags, and so forth. For example, you might process an HTML document that contains a header and a body like this:


public void processHtml(XmlPullParser parser) {

  while (true) {

    int event = parser.nextToken();

    if (event == XmlPullParser.START_TAG) {

      if (parser.getName().equals("head")) processHead(parser);

  	 else if (parser.getName().equals("body")) processBody(parser)

    }

    else if (event == XmlPullParser.END_TAG) { // </html>

      return;

    }

  }

}

Here I'm making a lot of assumptions about exactly which tags show up where and when. This isn't unusual in XML processing . Most applications are designed with particular vocabularies in mind. You wouldn't expect an XHTML outliner to know what to do with a DocBook document, much less an SVG picture, for example. However, it is best to test and verify your expectations about data formats. Normally, this would be done through validation. Pull parsers don't yet support validation, but XMLPULL offers an alternative. If you expect a particular token to be present in the document, you can require it using a type and an optional name and namespace. For example, if I think that the current token is an XHTML <head> start-tag, I'd require it thusly:


parser.require(XmlPullParser.START_TAG, 

               "http://www.w3.org/1999/xhtml", 

               "head");

If my expectation proves wrong, then the require() method throws XmlPullParserException, a checked exception. You can pass null for either the namespace or the element name to indicate that all namespaces and/or names are acceptable.

We can expand the chance of this working by using the nextTag() method instead of the nextToken() method. nextTag() skips over comments, entity references, processing instructions, whitespace-only text nodes, and other non-tag nodes. It does throw an XmlPullParserException if it encounters unexpected non-whitespace text. Putting this all together, the general pattern might be


try {

  parser.nextTag();

  parser.require(XmlPullParser.START_TAG, 

                 "http://www.w3.org/1999/xhtml", 

                 "head");

  processHead(parser);

}

catch (XmlPullParserException e) {

  // Oops! The head was missing!

}

Summing Up

XMLPULL can be a fast, simple, and memory-thrifty means of loading data from an XML document whose structure is well known in advance. State management is much simpler in XMLPULL than in SAX, so if you find that the SAX logic is just getting way too complex to follow or debug, then XMLPULL might be a good alternative. However, because the existing XMLPULL parsers don't support validation, robustness requires adding a lot of validation code to the program that would not be necessary in the SAX or DOM equivalent. This is probably only worthwhile when the DOM equivalent program would use too much memory. Otherwise, a validating DOM program will be much more robust. The other thing that might indicate choosing XMLPULL over DOM would be a situation in which streaming was important; that is, you want to begin generating output from the input almost immediately without waiting for the entire document to be read.

However, in my opinion XMLPULL is not yet suitable as a general purpose Java API for processing XML. It should not be your first choice for most applications. In particular, XMLPULL has two major flaws:

  1. The API does not model XML correctly.

  2. The API is not object oriented.

These are two very big problems. With respect to XML, XMLPULL does not support namespaces by default and does not read or report well-formedness errors in the internal DTD subset. The namespace flaw can be fixed by setting the appropriate feature, and in theory the internal DTD subset problem can be as well. But the existing parsers don't support this. Furthermore, the defaults are exactly backwards from what they should be for both; and while there might rarely be justification for turning off namespace processing, turning off processing of the internal DTD subset is simply not allowed by the XML specification. A parser that does not read the internal DTD subset is not an XML parser.

The object problems are less fundamentally wrong but still extremely troubling. XMLPULL has far too few classes. The prevalence of switch statements and stacks of if-else-if blocks just to test the return type of the nextToken() method is a classic symptom of failure to take advantage of polymorphism. Another hint that something is seriously wrong here is the number of state-dependent methods that only work when the parser is positioned on a particular kind of token. Still another clue is the use of int type constants instead of a class hierarchy. The next(), nextTag(), and nextToken() methods should all return instances of a common Token superclass. Many methods in XmlPullParser could be moved into this class. The whole API smells of procedural code and so doesn't fit very well into object-oriented Java designs.

Regrettably, the XMLPULL designers seem very committed to the current API. These problems are not casual bugs. They are deliberate design decisions, based on a desire to reduce the footprint of XMLPULL to the minimum possible for J2ME environments. None of these problems are likely to be fixed in the future. The trade-offs made in the name of size may be acceptable if you're working in J2ME. They are completely unacceptable in a desktop or server environment. Thus XMLPULL seems destined to remain a niche API for developers seeking efficiency at all costs.

Nonetheless, there are some interesting ideas here. Most importantly, the problems I've identified stem from implementation issues, not from anything fundamental to a pull-based model for XML processing. A future pull-API that learned from XMLPULL's mistakes could easily become a real alternative to SAX and DOM.