Pull Parsing in C# and Java
In my first
article in this series, I wrote about porting a SAX application
called RSSReader to the new Microsoft .NET Framework
XmlReader. After publication, I received a message from
Chris Lovett of Microsoft suggesting I revisit the subject. As he
said, while the code I presented works, my approach was not optimal
for the .NET framework; I was still thinking in terms of SAX event
driven state machinery. A much easier way to approach this problem is
to take advantage of the fact that XmlReader does not
make you think this way; and, thus, to write a recursive descent RSS
transformer as outlined below.
Based on Chris' suggestions, I've also made some other changes,
including changing the output mechanism to use the
XmlTextWriter, which will take care of generating well
formed XHTML on the output side.
And following all that, in a reversal of our usual process, I'll port this code back to Java.
Here then, without further ado, is the new RSSReader, optimized for C#. I've given the entire listing here, follows by an explanation.
using System;
using System.IO;
using System.Xml;
using System.Net;
public class RSSReader {
public static void Main(string [] args) {
// create an instance of RSSReader
RSSReader rssreader = new RSSReader();
try {
string url = args[0];
XmlTextWriter writer = new XmlTextWriter(Console.Out);
writer.Formatting = Formatting.Indented;
HttpWebRequest wr = (HttpWebRequest)WebRequest.Create(url);
WebResponse resp = wr.GetResponse();
Stream stream = resp.GetResponseStream();
XmlTextReader reader = new XmlTextReader(stream);
reader.XmlResolver = null; // ignore the DTD
reader.WhitespaceHandling = WhitespaceHandling.None;
rssreader.RSSToHtml(reader, writer);
} catch (XmlException e) {
Console.WriteLine(e.Message);
}
}
public void RSSToHtml(XmlReader reader, XmlWriter writer) {
reader.MoveToContent();
if (reader.Name == "rss") {
writer.WriteStartElement("html");
while (reader.Read() &&
reader.NodeType != XmlNodeType.EndElement) {
switch (reader.LocalName) {
case "channel":
ChannelToHtml(reader, writer);
break;
case "item":
ItemToHtml(reader, writer);
break;
default: // ignore image and textinput.
break;
}
}
writer.WriteEndElement();
} else {
// not an RSS document!
}
}
void ChannelToHtml(XmlReader reader, XmlWriter writer) {
writer.WriteStartElement("head");
// scan header elements and pick out the title.
reader.Read();
while (reader.Name != "item" &&
reader.NodeType != XmlNodeType.EndElement) {
if (reader.Name == "title") {
writer.WriteNode(reader, true); // copy node to output.
} else {
reader.Skip();
}
}
writer.WriteEndElement();
writer.WriteStartElement("body");
// transform the items.
while (reader.NodeType != XmlNodeType.EndElement) {
if (reader.Name == "item") {
ItemToHtml(reader, writer);
}
if (!reader.Read())
break;
}
writer.WriteEndElement();
}
void ItemToHtml(XmlReader reader, XmlWriter writer) {
writer.WriteStartElement("p");
string title = null, link = null, description = null;
while (reader.Read() &&
reader.NodeType != XmlNodeType.EndElement) {
switch (reader.Name) {
case "title":
title = reader.ReadString();
break;
case "link":
link = reader.ReadString();
break;
case "description":
description = reader.ReadString();
break;
}
}
writer.WriteStartElement("a");
writer.WriteAttributeString("href", link);
writer.WriteString(title);
writer.WriteEndElement();
writer.WriteStartElement("br");
writer.WriteEndElement();
writer.WriteString(description);
writer.WriteEndElement(); // end the "p" element
}
}
Explaining the Code
The Main entry point to the new RSSReader uses the
System.Net classes directly to setup a
WebRequest. You also see the XmlTextWriter
being constructed, turning on indenting so we get a nice readable
output. Then the XmlReader and XmlWriter
become arguments to a recursive descent RSS parser; the top level
method is called RSSToHtml().
The top level RSSToHtml() method first checks that we
really have an RSS file, by checking the root element name.
MoveToContent() is a convenient way of skipping the XML
prolog and going right to the top level element in the document. If
the XML document used namespaces, then we'd also want to match on the
NamespaceUri property; however, this particular XML
document doesn't use namespaces. If we find an
<rss> element, then we read the contents, calling
ChannelToHtml() when we find a
<channel> element and calling
ItemToHtml() when we find an <item>
element. Any other element is skipped. This is all wrapped in the
XmlWriter call to write the root level
<html> output element.
|
|
| Post your comments |
The ChannelToHtml() method does two things: it writes
out the HTML head element containing a <title>
element, then it writes out the HTML body. Notice here we can simply
use the XmlWriter.WriteNode() method which copies the
<title> element from the input reader to the output,
since an HTML <title> is exactly the same as an RSS
one. The HTML head element terminates when we reach the first child
<item> element or the </channel>
EndElement token. In the HTML body we look for
<item> elements and call
ItemToHtml().
The ItemToHtml() method writes out an HTML
<p> tag, then reads the <title>,
<link> and <description> elements out
of the input. These input tags could arrive in any order, which is
why we have to read them all before we can write the output. Once we
have them we can write the <a> tag, with
<href attribute equal to the <link>
element, and content equal to the <title>, followed by
an empty <br> element and the description.
All in all, it seems like a much simpler way to deal with converting RSS to HTML. .NET's built-in XML parser is pretty neat.
Pages: 1, 2 |