The XMLPULL API
Elliotte Rusty Harold is coauthor of XML in a Nutshell, 2nd Edition.
Most XML APIs are either event-based like SAX and XNI or tree-based APIs like DOM, JDOM, and dom4j. Most programmers find tree-based APIs to be easier to use, but they are less efficient, especially when it comes to memory usage. A typical in-memory tree is several times larger than the document it models. These APIs are normally not practical for documents larger than a few megabytes in size or in memory-constrained environments. In these situations, a streaming API such as SAX or XNI is normally chosen. However, these APIs model the parser rather the document. They push the content of the document to the client application as soon as they see it, whether the client is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require are unfamiliar and uncomfortable to many developers.
XMLPULL is a new streaming API that can read arbitrarily large documents like SAX. However, as the name indicates, it is based on a pull model rather than a push model. In XMLPULL the client is in control rather than the parser. The application tells the parser when it wants to receive the next data chunk rather than the parser telling the client when the next chunk of data is available.
|
Related Reading
XML in a Nutshell, 2nd Edition |
Like SAX, XMLPULL is an open source, parser independent pure Java API based on interfaces that can be implemented by multiple parsers. Currently there are two implementations, both free:
The API defines only one class, one interface, and one exception:
XmlPullParser: an abstract class that represents the parserXmlPullParserFactory: the factory class that instantiates an implementation dependent subclass of
XmlPullParserXmlPullException: the generic class for everything other than anIOExceptionthat might go wrong when parsing an XML document, particularly well-formedness errors and tokens that don't have the expected type
Most XMLPULL programs begin by using the factory class to load a parser:
XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
XmlPullParser parser = factory.newPullParser();
If anything goes wrong with this, then an
XmlPullParserException is thrown.
Next, the parser is pointed at a particular input stream with a certain encoding. For example,
URL u = new URL("http://www.cafeconleche.org/");
InputStream in = u.openStream();
parser.setInput(in, "ISO-8859-1");
If you don't know the encoding, you can pass null and the parser will try to guess it from the input stream based on the usual clues like the byte order mark and the encoding declaration.
Now it's time to actually read the document. You can think of the
XmlPullParser as an iterator across all the different tags,
text nodes, and other information items in the XML document. You invoke
its nextToken() method to advance from one token to the next,
and then use various getter methods to extract data from that
chunk. Some of the most important of these include:
getEventType()
getName()
getNamespace()
getPrefix()
getText()
getAttributeCount()
getAttributeName(int index)
getAttributeNamespace(int index)
getAttributePrefix(int index)
getAttributeType(int index)
getAttributeValue(int index)
getAttributeValue(String namespace, String name)
Not all of these methods work all the time. For instance, if the
XmlPullParser is positioned on an end-tag then you can get
the name, namespace, and prefix but not the attributes or the text. If the
XmlPullParser is positioned on a text node, then you can get
the text but not the name, namespace, prefix, or attributes. Text nodes
just don't have these things. To find out what kind of node the parser is
currently positioned on, you call the getEventType()
method. This returns one of these eleven int constants:
XmlPullParser.START_DOCUMENT
XmlPullParser.CDSECT
XmlPullParser.COMMENT
XmlPullParser.DOCDECL
XmlPullParser.START_TAG
XmlPullParser.END_TAG
XmlPullParser.ENTITY_REF
XmlPullParser.IGNORABLE_WHITESPACE
XmlPullParser.PROCESSING_INSTRUCTION
XmlPullParser.TEXT
XmlPullParser.END_DOCUMENT
For example, here's a simple bit of code that iterates through an XML document and prints out the names of the different elements it encounters:
while (true) {
int event = parser.next();
if (event == XmlPullParser.END_DOCUMENT) break;
if (event == XmlPullParser.START_TAG) {
System.out.println(parser.getName());
}
}
Here's the start of the output when I ran this across a simple well-formed HTML file:
html
head
title
meta
meta
script
body
div
...
If you're only concerned with tags, text, and documents, you can use
the next() method instead of nextToken(). This
method silently skips all comments, processing instructions, document-type
declarations, and ignorable white space. It merges CDATA sections and
entities into their surrounding text. Unresolvable entities cause an
XmlPullParserException. Thus, the kinds of events it reports
are only START_DOCUMENT, START_TAG,
END_TAG, TEXT, and
END_DOCUMENT.
Pages: 1, 2 |
