Pull Parsing in C# and Java
In my first
article in this series, I wrote about porting a SAX application
called RSSReader to the new Microsoft .NET Framework
XmlReader. After publication, I received a message from
Chris Lovett of Microsoft suggesting I revisit the subject. As he
said, while the code I presented works, my approach was not optimal
for the .NET framework; I was still thinking in terms of SAX event
driven state machinery. A much easier way to approach this problem is
to take advantage of the fact that XmlReader does not
make you think this way; and, thus, to write a recursive descent RSS
transformer as outlined below.
Based on Chris' suggestions, I've also made some other changes,
including changing the output mechanism to use the
XmlTextWriter, which will take care of generating well
formed XHTML on the output side.
And following all that, in a reversal of our usual process, I'll port this code back to Java.
Here then, without further ado, is the new RSSReader, optimized for C#. I've given the entire listing here, follows by an explanation.
using System;
using System.IO;
using System.Xml;
using System.Net;
public class RSSReader {
public static void Main(string [] args) {
// create an instance of RSSReader
RSSReader rssreader = new RSSReader();
try {
string url = args[0];
XmlTextWriter writer = new XmlTextWriter(Console.Out);
writer.Formatting = Formatting.Indented;
HttpWebRequest wr = (HttpWebRequest)WebRequest.Create(url);
WebResponse resp = wr.GetResponse();
Stream stream = resp.GetResponseStream();
XmlTextReader reader = new XmlTextReader(stream);
reader.XmlResolver = null; // ignore the DTD
reader.WhitespaceHandling = WhitespaceHandling.None;
rssreader.RSSToHtml(reader, writer);
} catch (XmlException e) {
Console.WriteLine(e.Message);
}
}
public void RSSToHtml(XmlReader reader, XmlWriter writer) {
reader.MoveToContent();
if (reader.Name == "rss") {
writer.WriteStartElement("html");
while (reader.Read() &&
reader.NodeType != XmlNodeType.EndElement) {
switch (reader.LocalName) {
case "channel":
ChannelToHtml(reader, writer);
break;
case "item":
ItemToHtml(reader, writer);
break;
default: // ignore image and textinput.
break;
}
}
writer.WriteEndElement();
} else {
// not an RSS document!
}
}
void ChannelToHtml(XmlReader reader, XmlWriter writer) {
writer.WriteStartElement("head");
// scan header elements and pick out the title.
reader.Read();
while (reader.Name != "item" &&
reader.NodeType != XmlNodeType.EndElement) {
if (reader.Name == "title") {
writer.WriteNode(reader, true); // copy node to output.
} else {
reader.Skip();
}
}
writer.WriteEndElement();
writer.WriteStartElement("body");
// transform the items.
while (reader.NodeType != XmlNodeType.EndElement) {
if (reader.Name == "item") {
ItemToHtml(reader, writer);
}
if (!reader.Read())
break;
}
writer.WriteEndElement();
}
void ItemToHtml(XmlReader reader, XmlWriter writer) {
writer.WriteStartElement("p");
string title = null, link = null, description = null;
while (reader.Read() &&
reader.NodeType != XmlNodeType.EndElement) {
switch (reader.Name) {
case "title":
title = reader.ReadString();
break;
case "link":
link = reader.ReadString();
break;
case "description":
description = reader.ReadString();
break;
}
}
writer.WriteStartElement("a");
writer.WriteAttributeString("href", link);
writer.WriteString(title);
writer.WriteEndElement();
writer.WriteStartElement("br");
writer.WriteEndElement();
writer.WriteString(description);
writer.WriteEndElement(); // end the "p" element
}
}
The Main entry point to the new RSSReader uses the
System.Net classes directly to setup a
WebRequest. You also see the XmlTextWriter
being constructed, turning on indenting so we get a nice readable
output. Then the XmlReader and XmlWriter
become arguments to a recursive descent RSS parser; the top level
method is called RSSToHtml().
The top level RSSToHtml() method first checks that we
really have an RSS file, by checking the root element name.
MoveToContent() is a convenient way of skipping the XML
prolog and going right to the top level element in the document. If
the XML document used namespaces, then we'd also want to match on the
NamespaceUri property; however, this particular XML
document doesn't use namespaces. If we find an
<rss> element, then we read the contents, calling
ChannelToHtml() when we find a
<channel> element and calling
ItemToHtml() when we find an <item>
element. Any other element is skipped. This is all wrapped in the
XmlWriter call to write the root level
<html> output element.
|
|
| Post your comments |
The ChannelToHtml() method does two things: it writes
out the HTML head element containing a <title>
element, then it writes out the HTML body. Notice here we can simply
use the XmlWriter.WriteNode() method which copies the
<title> element from the input reader to the output,
since an HTML <title> is exactly the same as an RSS
one. The HTML head element terminates when we reach the first child
<item> element or the </channel>
EndElement token. In the HTML body we look for
<item> elements and call
ItemToHtml().
The ItemToHtml() method writes out an HTML
<p> tag, then reads the <title>,
<link> and <description> elements out
of the input. These input tags could arrive in any order, which is
why we have to read them all before we can write the output. Once we
have them we can write the <a> tag, with
<href attribute equal to the <link>
element, and content equal to the <title>, followed by
an empty <br> element and the description.
All in all, it seems like a much simpler way to deal with converting RSS to HTML. .NET's built-in XML parser is pretty neat.
|
But pull parsers are not unique to the .NET world. The Java Community Process is currently working on a standard called StAX, the Streaming API for XML. This nascent API is, in turn, based upon several vendors' pull parser implementations, notably Apache's Xerces XNI, BEA's XML Stream API, XML Pull Parser 2, PullDOM (for Python), and, yes, Microsoft's XmlReader.
So how would we implement this same program in yet another pull parser, the Common API for XML Pull Parsing, or XPP? Let's take a look.
package com.xml;
import java.io.*;
import java.net.*;
import java.util.*;
import com.alexandriasc.xml.XMLWriter;
import org.xmlpull.v1.*;
public class RSSReader {
public static void main(String [] args) {
// create an instance of RSSReader
RSSReader rssreader = new RSSReader();
XMLWriter writer = null;
try {
String url = args[0];
writer = new XMLWriter(new OutputStreamWriter(System.out),false);
XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
XmlPullParser parser = factory.newPullParser();
InputStreamReader stream = new InputStreamReader(
new URL(url).openStream());
parser.setInput(stream);
parser.setFeature(XmlPullParser.FEATURE_PROCESS_DOCDECL,false);
rssreader.RSSToHtml(parser, writer);
} catch (Exception e) {
e.printStackTrace(System.err);
} finally {
try {
writer.flush();
} catch (IOException io) {
io.printStackTrace(System.err);
}
}
}
public void RSSToHtml(XmlPullParser parser, XMLWriter writer)
throws IOException, XmlPullParserException {
// equivalent to XmlReader.MoveToContent()
while (parser.next() != XmlPullParser.START_TAG
&& !parser.getName().equals("rss")) {
}
if (parser.getName().equals("rss")) {
writer.beginElement("html");
do {
parser.next();
if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("channel")) {
ChannelToHtml(parser, writer);
} else if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("item")) {
ItemToHtml(parser, writer);
}
} while (parser.getEventType() != XmlPullParser.END_DOCUMENT);
writer.endElement();
} else {
// not an RSS document!
}
}
void ChannelToHtml(XmlPullParser parser, XMLWriter writer)
throws IOException, XmlPullParserException {
writer.beginElement("head");
// scan header elements and pick out the title.
while (!(parser.next() == XmlPullParser.END_TAG
&& parser.getName().equals("channel"))) {
if (parser.getEventType() == XmlPullParser.START_TAG) {
do {
if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("title")) {
while (parser.next() != XmlPullParser.END_TAG) {
if (parser.getEventType() == XmlPullParser.TEXT) {
writer.writeElement("title",null,parser.getText());
break;
}
}
break;
}
} while (parser.next() != XmlPullParser.END_TAG);
break;
}
}
writer.endElement();
writer.beginElement("body");
// transform the items.
do {
if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("item")) {
ItemToHtml(parser, writer);
}
parser.next();
} while (parser.getEventType() != XmlPullParser.END_DOCUMENT);
writer.endElement();
}
void ItemToHtml(XmlPullParser parser, XMLWriter writer)
throws IOException, XmlPullParserException {
writer.beginElement("p");
String title = null, link = null, description = null;
while (parser.next() != XmlPullParser.END_DOCUMENT
&& parser.getEventType() != XmlPullParser.END_TAG) {
if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("title")) {
if (parser.next() == XmlPullParser.TEXT)
title = parser.readText();
} else if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("link")) {
if (parser.next() == XmlPullParser.TEXT)
link = parser.readText();
} else if (parser.getEventType() == XmlPullParser.START_TAG
&& parser.getName().equals("description")) {
if (parser.next() == XmlPullParser.TEXT)
description = parser.readText();
}
}
HashMap attributes = new HashMap(1);
attributes.put("href", link);
writer.beginElement("a",attributes);
writer.write(title);
writer.endElement();
writer.writeEmptyElement("br");
writer.write(description);
writer.endElement(); // end the "p" element
}
}
Most of our port was the reverse of our previous
ports; for example, changing Console.Out to
System.out, making method names start with lowercase
letters, adding explicit throws clauses. The real meat of
this port is in two areas.
First, we're using XmlPullParser as a rough equivalent
of XmlTextReader. One difference is that while we are
able to instantiate an XmlTextReader directly in C#
(remember, Microsoft is a one-stop shop), we have to use the Java
XmlPullParserFactory to get a concrete implementation of
the XmlPullParser interface. This should be a familiar
exercise for anyone who's used JAXP or, for that matter, JDBC.
Once we have the parser, most of the method name equivalencies are
obvious. Remember that in C# the == operator works just
fine for strings, but in Java you must use the
.equals() method; otherwise you'll be comparing object
references rather than their values, not at all what we want to
do. Also, you can't use a String as the expression in a
switch...case statement in Java, so we've turned those
into an if...else structure.
Another difference between the .NET XmlReader and the
Java XmlPullParser has to do with the way in which events
are pulled out of the XMLdocument. In the former, the
ReadString() method will return all the text for the
current element; while in the latter, next() must
explicitly be called to position the parser at the text node before
calling getText() or readText() to read the
text.
This may be a minor difference, but it tends to make our port a
little more difficult. To better handle this requirement, I've
changed several while loops into do...while
loops. This, unfortunately, makes it less than a simple port; the
logic has changed, but not considerably.
Second, there is no XmlTextWriter in Java, so we're
using Alexandria Software Consulting's XmlHelper
package, which contains a class called XMLWriter. Besides
the naming of methods, XMLWriter operates almost
identically to .NET's XmlWriter, except for two
details.
First, XMLWriter has the notion of a collection of
attributes, whereas XmlWriter requires you to write each
attribute individually. In Java, we call beginElement(),
passing the name and the Map of attributes, whereas in
C#, we called WriteStartElement() followed by
WriteAttributeString().
Second, XMLWriter has a
writeEmptyElement() method, where XmlWriter
requires you to call WriteStartElement() followed by
WriteEndElement(). However, .NET automatically collapses
an empty element into a short end element (in this case, <br
/>). .NET's way gives you the flexibility of determining
whether the element is empty at runtime. If, however, you need to
force an end tag, you can call WriteFullEndElement()
instead of WriteEndElement().
A pull parser makes it much easier to process XML, especially when you are processing XML with a well-defined grammar like RSS. This code is much easier to understand and maintain since there's no complex state machine to build or maintain. In fact, this code is completely stateless; the pull parser keeps track of all the state for us. So in that sense a pull parser is a higher level way of processing XML than SAX.
Although my original code quite intentionally didn't do any error handling, error handling in a push model state machine adds even more complexity to an already complex model. The new RSSReader has clear placeholders for error handling code in the cases when the input doesn't comply with the expected RSS DTD.
Performance can be an important consideration in an XML parser.
Notice the call to Skip() (in the C# version) when we
find elements we're not interested in. In this case the XML parser
can skip over entire subtrees of XML without having to call us back on
every element, even ones we know we're not interested in. In this
case we skip over the <image> elements and all their
children. Second, in C# we could optimize out all the element name
string comparisons and make the atomized pointer comparisons if we
used the XmlReader's NameTable to
pre-atomize those strings.
Finally, using an XML writer makes our output generation more robust. For example, it will correctly convert special characters -- <, &, etc. -- into their respective entity references. Because it maintains its own state internally, it never forgets which element to close after a convoluted series of while loops. And it will always produce XML output in the consistent and readable format of your choice.
And now for the inevitable comparison between .NET's
XmlReader/XmlWriter and the equivalent functionality in
Java. As usual, I'll say that in .NET, Microsoft has provided it all
for you and, thus, it is undeniably simpler to learn and
use. The C# version of our RSSReader is about 20% shorter than the
Java version, which is great unless you work in one of those shops
which still measures productivity in KLOCs. And the readability of the
code itself is much greater in C#, although that probably can be
chalked up at least in part to my own lack of skill in that conversion
from while to do...while.
But the real bottom line remains that doing it the .NET way means that Microsoft provides all the standards-compliant tools that 90% of developers are likely to need, while the Java way still means putting together a solution from various pieces that you can scrounge from various sources. Some of those pieces come from the Java Community Process and thus represent peer-reviewed, formally approved APIs, but some come from a quick search of the Web, and in the end only you are qualified to judge their worthiness.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.