XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Pull Parsing in C# and Java

May 22, 2002

In my first article in this series, I wrote about porting a SAX application called RSSReader to the new Microsoft .NET Framework XmlReader. After publication, I received a message from Chris Lovett of Microsoft suggesting I revisit the subject. As he said, while the code I presented works, my approach was not optimal for the .NET framework; I was still thinking in terms of SAX event driven state machinery. A much easier way to approach this problem is to take advantage of the fact that XmlReader does not make you think this way; and, thus, to write a recursive descent RSS transformer as outlined below.

Based on Chris' suggestions, I've also made some other changes, including changing the output mechanism to use the XmlTextWriter, which will take care of generating well formed XHTML on the output side.

And following all that, in a reversal of our usual process, I'll port this code back to Java.

Here then, without further ado, is the new RSSReader, optimized for C#. I've given the entire listing here, follows by an explanation.

using System;
using System.IO;
using System.Xml;
using System.Net;

public class RSSReader {
  public static void Main(string [] args) {
    // create an instance of RSSReader
    RSSReader rssreader = new RSSReader();

    try {
      string url = args[0];
      XmlTextWriter writer = new XmlTextWriter(Console.Out);
      writer.Formatting = Formatting.Indented;
      HttpWebRequest wr = (HttpWebRequest)WebRequest.Create(url);
      WebResponse resp = wr.GetResponse();
      Stream stream = resp.GetResponseStream();
      XmlTextReader reader = new XmlTextReader(stream);
      reader.XmlResolver = null; // ignore the DTD
      reader.WhitespaceHandling = WhitespaceHandling.None;
      rssreader.RSSToHtml(reader, writer);
    } catch (XmlException e) {
      Console.WriteLine(e.Message);
    }
  }

  public void RSSToHtml(XmlReader reader, XmlWriter writer) {
    reader.MoveToContent();
    if (reader.Name == "rss") {
      writer.WriteStartElement("html");
      while (reader.Read() &&
        reader.NodeType != XmlNodeType.EndElement) {
        switch (reader.LocalName) {
        case "channel":
          ChannelToHtml(reader, writer);
          break;
        case "item":
          ItemToHtml(reader, writer);
          break;
        default: // ignore image and textinput.
          break;
        }
      }
      writer.WriteEndElement();
    } else {
      // not an RSS document!
    }
  }

  void ChannelToHtml(XmlReader reader, XmlWriter writer) {
    writer.WriteStartElement("head");
    // scan header elements and pick out the title.
    reader.Read();
    while (reader.Name != "item" &&
      reader.NodeType != XmlNodeType.EndElement) {
      if (reader.Name == "title") {
        writer.WriteNode(reader, true); // copy node to output.
      } else {
        reader.Skip();
      }
    }
    writer.WriteEndElement();

    writer.WriteStartElement("body");
    // transform the items.
    while (reader.NodeType != XmlNodeType.EndElement) {
      if (reader.Name == "item") {
        ItemToHtml(reader, writer);
      }
      if (!reader.Read())
        break;
    }
    writer.WriteEndElement();
  }

  void ItemToHtml(XmlReader reader, XmlWriter writer) {
    writer.WriteStartElement("p");

    string title = null, link = null, description = null;
    while (reader.Read() &&
      reader.NodeType != XmlNodeType.EndElement) {
      switch (reader.Name) {
      case "title":
        title = reader.ReadString();
        break;
      case "link":
        link = reader.ReadString();
        break;
      case "description":
        description = reader.ReadString();
        break;
      }
    }
    writer.WriteStartElement("a");
    writer.WriteAttributeString("href", link);
    writer.WriteString(title);
    writer.WriteEndElement();

    writer.WriteStartElement("br");
    writer.WriteEndElement();

    writer.WriteString(description);

    writer.WriteEndElement(); // end the "p" element
  }
}

Explaining the Code

The Main entry point to the new RSSReader uses the System.Net classes directly to setup a WebRequest. You also see the XmlTextWriter being constructed, turning on indenting so we get a nice readable output. Then the XmlReader and XmlWriter become arguments to a recursive descent RSS parser; the top level method is called RSSToHtml().

The top level RSSToHtml() method first checks that we really have an RSS file, by checking the root element name. MoveToContent() is a convenient way of skipping the XML prolog and going right to the top level element in the document. If the XML document used namespaces, then we'd also want to match on the NamespaceUri property; however, this particular XML document doesn't use namespaces. If we find an <rss> element, then we read the contents, calling ChannelToHtml() when we find a <channel> element and calling ItemToHtml() when we find an <item> element. Any other element is skipped. This is all wrapped in the XmlWriter call to write the root level <html> output element.

Comment on this article Have you tried C# for XML applications? What about pull parsers versus SAX? Share your experience in our forum.
Post your comments

The ChannelToHtml() method does two things: it writes out the HTML head element containing a <title> element, then it writes out the HTML body. Notice here we can simply use the XmlWriter.WriteNode() method which copies the <title> element from the input reader to the output, since an HTML <title> is exactly the same as an RSS one. The HTML head element terminates when we reach the first child <item> element or the </channel> EndElement token. In the HTML body we look for <item> elements and call ItemToHtml().

The ItemToHtml() method writes out an HTML <p> tag, then reads the <title>, <link> and <description> elements out of the input. These input tags could arrive in any order, which is why we have to read them all before we can write the output. Once we have them we can write the <a> tag, with <href attribute equal to the <link> element, and content equal to the <title>, followed by an empty <br> element and the description.

All in all, it seems like a much simpler way to deal with converting RSS to HTML. .NET's built-in XML parser is pretty neat.

Pages: 1, 2

Next Pagearrow







close