XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Pull Parsing in C# and Java
by Niel Bornstein | Pages: 1, 2

Java Pull Parsers

But pull parsers are not unique to the .NET world. The Java Community Process is currently working on a standard called StAX, the Streaming API for XML. This nascent API is, in turn, based upon several vendors' pull parser implementations, notably Apache's Xerces XNI, BEA's XML Stream API, XML Pull Parser 2, PullDOM (for Python), and, yes, Microsoft's XmlReader.

So how would we implement this same program in yet another pull parser, the Common API for XML Pull Parsing, or XPP? Let's take a look.

package com.xml;

import java.io.*;
import java.net.*;
import java.util.*;

import com.alexandriasc.xml.XMLWriter;
import org.xmlpull.v1.*;

public class RSSReader {

  public static void main(String [] args) {
    // create an instance of RSSReader
    RSSReader rssreader = new RSSReader();

    XMLWriter writer = null;
    try {
      String url = args[0];
      writer = new XMLWriter(new OutputStreamWriter(System.out),false);
      XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
      XmlPullParser parser = factory.newPullParser();
      InputStreamReader stream = new InputStreamReader(
        new URL(url).openStream());
      parser.setInput(stream);
      parser.setFeature(XmlPullParser.FEATURE_PROCESS_DOCDECL,false);
      rssreader.RSSToHtml(parser, writer);
    } catch (Exception e) {
      e.printStackTrace(System.err);
    } finally {
      try {
        writer.flush();
      } catch (IOException io) {
        io.printStackTrace(System.err);
      }
    }
  }

  public void RSSToHtml(XmlPullParser parser, XMLWriter writer)
  throws IOException, XmlPullParserException {
    // equivalent to XmlReader.MoveToContent()
    while (parser.next() != XmlPullParser.START_TAG
      && !parser.getName().equals("rss")) {
    }
    if (parser.getName().equals("rss")) {
      writer.beginElement("html");
      do {
        parser.next();
        if (parser.getEventType() == XmlPullParser.START_TAG
          && parser.getName().equals("channel")) {
          ChannelToHtml(parser, writer);
        } else if (parser.getEventType() == XmlPullParser.START_TAG
          && parser.getName().equals("item")) {
          ItemToHtml(parser, writer);
        }
      } while (parser.getEventType() != XmlPullParser.END_DOCUMENT);
      writer.endElement();
    } else {
      // not an RSS document!
    }
  }

  void ChannelToHtml(XmlPullParser parser, XMLWriter writer)
  throws IOException, XmlPullParserException {
    writer.beginElement("head");
    // scan header elements and pick out the title.
    while (!(parser.next() == XmlPullParser.END_TAG
      && parser.getName().equals("channel"))) {
      if (parser.getEventType() == XmlPullParser.START_TAG) {
        do {
          if (parser.getEventType() == XmlPullParser.START_TAG
            && parser.getName().equals("title")) {
            while (parser.next() != XmlPullParser.END_TAG) {
              if (parser.getEventType() == XmlPullParser.TEXT) {
                writer.writeElement("title",null,parser.getText());
                break;
              }
            }
            break;
          }
        } while (parser.next() != XmlPullParser.END_TAG);
        break;
      }
    }
    writer.endElement();

    writer.beginElement("body");
    // transform the items.
    do {
      if (parser.getEventType() == XmlPullParser.START_TAG 
        && parser.getName().equals("item")) {
        ItemToHtml(parser, writer);
      }
      parser.next();
    } while (parser.getEventType() != XmlPullParser.END_DOCUMENT);
    writer.endElement();
  }

  void ItemToHtml(XmlPullParser parser, XMLWriter writer)
  throws IOException, XmlPullParserException {
    writer.beginElement("p");

    String title = null, link = null, description = null;
    while (parser.next() != XmlPullParser.END_DOCUMENT
      && parser.getEventType() != XmlPullParser.END_TAG) {
      if (parser.getEventType() == XmlPullParser.START_TAG
        && parser.getName().equals("title")) {
        if (parser.next() == XmlPullParser.TEXT)
          title = parser.readText();
      } else if (parser.getEventType() == XmlPullParser.START_TAG
        && parser.getName().equals("link")) {
        if (parser.next() == XmlPullParser.TEXT)
          link = parser.readText();
      } else if (parser.getEventType() == XmlPullParser.START_TAG
        && parser.getName().equals("description")) {
        if (parser.next() == XmlPullParser.TEXT)
          description = parser.readText();
      }
    }
    HashMap attributes = new HashMap(1);
    attributes.put("href", link);
    writer.beginElement("a",attributes);
    writer.write(title);
    writer.endElement();

    writer.writeEmptyElement("br");

    writer.write(description);

    writer.endElement(); // end the "p" element
  }
}

Most of our port was the reverse of our previous ports; for example, changing Console.Out to System.out, making method names start with lowercase letters, adding explicit throws clauses. The real meat of this port is in two areas.

The Parser

First, we're using XmlPullParser as a rough equivalent of XmlTextReader. One difference is that while we are able to instantiate an XmlTextReader directly in C# (remember, Microsoft is a one-stop shop), we have to use the Java XmlPullParserFactory to get a concrete implementation of the XmlPullParser interface. This should be a familiar exercise for anyone who's used JAXP or, for that matter, JDBC.

Once we have the parser, most of the method name equivalencies are obvious. Remember that in C# the == operator works just fine for strings, but in Java you must use the .equals() method; otherwise you'll be comparing object references rather than their values, not at all what we want to do. Also, you can't use a String as the expression in a switch...case statement in Java, so we've turned those into an if...else structure.

Another difference between the .NET XmlReader and the Java XmlPullParser has to do with the way in which events are pulled out of the XMLdocument. In the former, the ReadString() method will return all the text for the current element; while in the latter, next() must explicitly be called to position the parser at the text node before calling getText() or readText() to read the text.

This may be a minor difference, but it tends to make our port a little more difficult. To better handle this requirement, I've changed several while loops into do...while loops. This, unfortunately, makes it less than a simple port; the logic has changed, but not considerably.

The Writer

Second, there is no XmlTextWriter in Java, so we're using Alexandria Software Consulting's XmlHelper package, which contains a class called XMLWriter. Besides the naming of methods, XMLWriter operates almost identically to .NET's XmlWriter, except for two details.

First, XMLWriter has the notion of a collection of attributes, whereas XmlWriter requires you to write each attribute individually. In Java, we call beginElement(), passing the name and the Map of attributes, whereas in C#, we called WriteStartElement() followed by WriteAttributeString().

Second, XMLWriter has a writeEmptyElement() method, where XmlWriter requires you to call WriteStartElement() followed by WriteEndElement(). However, .NET automatically collapses an empty element into a short end element (in this case, <br />). .NET's way gives you the flexibility of determining whether the element is empty at runtime. If, however, you need to force an end tag, you can call WriteFullEndElement() instead of WriteEndElement().

Conclusion

A pull parser makes it much easier to process XML, especially when you are processing XML with a well-defined grammar like RSS. This code is much easier to understand and maintain since there's no complex state machine to build or maintain. In fact, this code is completely stateless; the pull parser keeps track of all the state for us. So in that sense a pull parser is a higher level way of processing XML than SAX.

Although my original code quite intentionally didn't do any error handling, error handling in a push model state machine adds even more complexity to an already complex model. The new RSSReader has clear placeholders for error handling code in the cases when the input doesn't comply with the expected RSS DTD.

Performance can be an important consideration in an XML parser. Notice the call to Skip() (in the C# version) when we find elements we're not interested in. In this case the XML parser can skip over entire subtrees of XML without having to call us back on every element, even ones we know we're not interested in. In this case we skip over the <image> elements and all their children. Second, in C# we could optimize out all the element name string comparisons and make the atomized pointer comparisons if we used the XmlReader's NameTable to pre-atomize those strings.

Finally, using an XML writer makes our output generation more robust. For example, it will correctly convert special characters -- <, &, etc. -- into their respective entity references. Because it maintains its own state internally, it never forgets which element to close after a convoluted series of while loops. And it will always produce XML output in the consistent and readable format of your choice.

And now for the inevitable comparison between .NET's XmlReader/XmlWriter and the equivalent functionality in Java. As usual, I'll say that in .NET, Microsoft has provided it all for you and, thus, it is undeniably simpler to learn and use. The C# version of our RSSReader is about 20% shorter than the Java version, which is great unless you work in one of those shops which still measures productivity in KLOCs. And the readability of the code itself is much greater in C#, although that probably can be chalked up at least in part to my own lack of skill in that conversion from while to do...while.

But the real bottom line remains that doing it the .NET way means that Microsoft provides all the standards-compliant tools that 90% of developers are likely to need, while the Java way still means putting together a solution from various pieces that you can scrounge from various sources. Some of those pieces come from the Java Community Process and thus represent peer-reviewed, formally approved APIs, but some come from a quick search of the Web, and in the end only you are qualified to judge their worthiness.



1 to 3 of 3
  1. Returning a RSS Block
    2006-10-20 03:58:05 Phil_WBC
  2. XML to HTML
    2003-01-22 08:17:59 CJ Varghese
  3. corrections about XMLPULL API ...
    2002-06-13 12:22:25 Aleksander Slominski
1 to 3 of 3