
Inside the RSS Validator
In previous columns, I have introduced RSS and explored options for consuming it. Now we turn to the production side. Last month I stirred up a small controversy by suggesting that RSS consumers should go out of their way to consume as many feeds as possible, even ones which are not well-formed. This month I hope it will be somewhat less controversial to say that RSS producers should go out of their way to produce feeds that conform to specifications as well as possible.
Rule Zero is that all RSS feeds must be well-formed XML. Not all RSS consumers use the advanced techniques we discussed last month. Many can only parse RSS feeds that are well-formed XML. There are many tools for producing XML; you should use one of them as opposed to, say, using string concatenation and a non-XML-aware templating system and hoping for the best.
Beyond well-formedness, there are a number of domain-specific rules and
best practices for RSS feeds. These are fairly well encapsulated in the
free online RSS
validator. Point the validator at your RSS feed and follow its
instructions if it finds any errors or warnings. It will catch common XML
errors such as unescaped ampersands and high-bit characters;
domain-specific errors such as missing required elements; and more subtle
errors such as improper language codes in the
<language> element.
Lather, rinse, repeat till the validator clears your feed for takeoff. Check back every now and then to make sure other obscure bugs haven't crept up and made your feed go invalid, which may indicate bugs in your production software.
How the validator works internally is actually fairly interesting -- much more interesting than the arcane rules of RSS validity -- and that's where I'd like to focus. The validator is written in Python, and it is available under a liberal open source license, so you can download the complete source code and follow along.
The RSS validator relies on Python's built-in SAX interface,
xml.sax.handler. To use it, you subclass
ContentHandler and provide methods for
startElementNS (for start tags), endElementNS
(for end tags), and characters (for everything in between).
Of course, for anything but the most trivial applications, these will end
up being dispatch methods to the real code stored elsewhere, which you've
separated based on some criteria (namespace, element name, phase of the
moon).
As the SAX parser processes the input document, the RSS validator maintains a stack of handler objects. Each handler object knows just enough to validate a specific element, and it knows which other handler objects can validate the element's children. Each handler object is set up with contextual information about which element it's handling, what its parent element is, and what attributes, if any, were present in its start tag. The handler object introspects over its own methods to find one that can handle the current element and calls it. This method can perform the validation directly, or it can return one or more handler objects to perform additional validation. This will become clearer with some code.
Here is the first step, a subclass of
xml.sax.handler.ContentHandler, which initializes the handler
stack and then passes all startElementNS requests to the top
handler in the stack.
class SAXDispatcher(ContentHandler):
def __init__(self):
ContentHandler.__init__(self)
# prime the handler stack with the root handler object
self.handler_stack = [[root(self)]]
def startElementNS(self, name, qname, attrs):
qname, name = name
for handler in self.handler_stack[-1]:
# call all the handlers for the current element
handler.startElementNS(name, qname, attrs)
The second step is a base class for all handler objects. It's really a second-level dispatch; it introspects over its own methods to find a method which matches the current element's name, do_element. If found, it calls the method, which returns one or more handler objects. Each of these handlers is set up with contextual information and pushed onto the stack.
class validatorBase(ContentHandler):
def __init__(self):
ContentHandler.__init__(self)
self.value = ""
self.attrs = None
self.children = []
def startElementNS(self, name, qname, attrs):
from validators import eater
if qname:
handler = self.unknown_starttag(name, qname, attrs)
else:
try:
# look for specific method for this element (by local-name)
handlers = getattr(self, "do_" + name)()
except AttributeError:
# no specific method for this element, use default handler
handlers = [eater()]
# small hack: if method returns 1 handler, make it a list of 1
try: iter(handlers)
except TypeError: handlers = [handlers]
# set up contextual information for each handler object
for aHandler in handlers:
aHandler.parent = self
aHandler.value = ""
aHandler.name = name
aHandler.attrs = attrs
aHandler.prevalidate()
self.children.append(name)
# push handlers onto the stack
self.push(handlers)
Two other methods are present in validatorBase: the
characters method, which just buffers the raw text data
within the current element, and the endElementNS method,
which gets called when we get to the element's end tag and which calls a
validate method (defined in the descendant handler
objects).
def characters(self, string):
# buffer the text data for this element
self.value = self.value + string
def endElementNS(self, name, qname):
# we've buffered all the text data for this element, so validate it
self.validate()
Now we can start defining a hierarchy of handler objects to validate
different parts of the RSS feed. Each handler needs a
validate method to validate the element's data and a
do_ method for each possible child element. For example, the
root handler does no validation, but it knows about the rss
element, which is the top-level element of most RSS feeds. (This code
example is simplified; in reality we also need to handle an
rdf element, which is the top-level element of RSS 0.9 and
1.0 feeds.)
class root(validatorBase):
def do_rss(self):
from rss import rss
return rss()
The rss handler knows that every rss element
needs a channel child element and a version
attribute. It also has a do_channel method which dispatches
the validation for the child channel element.
class rss(validatorBase):
def validate(self):
if not "channel" in self.children:
self.log(MissingChannel({"element":self.name, "attr":"channel"}))
if (None, 'version') not in self.attrs.getNames():
self.log(MissingAttribute({"element":self.name, "attr":"version"}))
def do_channel(self):
from channel import channel
return channel()
The channel handler knows that every channel needs a
title, link, and description (and a few other rules), and it has
do_ methods for each possible child element of
channel: title, link,
description, item, items,
textInput and textinput (due to subtle
differences in various RSS versions -- seven specs, no waiting),
category, cloud, rating,
ttl, docs, generator,
pubDate, lastBuildDate,
managingEditor, webMaster,
language, copyright, skipHours,
skipDays, and blink. (There is no
blink tag in RSS, but there was some confusion about this, so
the validator presents a specific error message for it.)
class channel(validatorBase):
def validate(self):
if not "title" in self.children:
self.log(MissingTitle({"parent":self.name, "element":"title"}))
if not "link" in self.children:
self.log(MissingLink({"parent":self.name, "element":"link"}))
if not "description" in self.children:
self.log(MissingDescription({"parent":self.name,"element":"description"}))
# several rules omitted here
...
def do_title(self):
return nonhtml(), noduplicates()
def do_link(self):
return rfc2396(), noduplicates()
def do_description(self):
return nonhtml(), noduplicates()
...
# lots of other do_ methods omitted
As you can see, several of the do_ methods return a list
of individual handlers. A channel link must be an
RFC-2396-compliant URI, and there can be only one link
element per channel. Each of these rules is encoded in its own handler
object:
class rfc2396(validatorBase):
rfc2396_re = re.compile("[a-zA-Z][0-9a-zA-Z+\\-\\.]*:(//)?" +
"[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%,#]+$")
def validate(self, errorClass=InvalidLink):
if (not self.value) or (not self.rfc2396_re.match(self.value)):
self.log(errorClass({"element":self.name, "value":self.value}))
class noduplicates(validatorBase):
def prevalidate(self):
if self.name in self.parent.children:
self.log(DuplicateElement({"parent":self.parent.name, "element":self.name}))
When the channel.do_link method returns the list of
rfc2396 and noduplicates handler objects, the
secondary dispatcher in validatorBase.startElementNS pushes
them onto the stack, where the main dispatcher in
SAXDispatcher.startElementNS pops them off and calls each of
them in turn. Both the rfc2396 instance and the
noduplicates instance are each set up with contextual
information for the current link element; they each perform
their own validation log their own errors.
This all may seem like a lot of indirection -- and it is -- but it has several advantages:
- It's easy to add new functionality. Adding support for a new element
requires writing a new handler object that inherits from
validatorBase, then adding a singledo_method in the parent element's handler. - It's easy to debug. None of the individual handler objects interact with each other. There are no side effects.
- It encourages code reuse. Several different elements in various
levels of an RSS document have similar validation logic. For instance,
docs,link, and the elements within the optionalblogChannelmodule all need to be RFC-2396-compliant URIs.
|
More Dive Into XML Columns | |
With this framework in place, and an entire hierarchy of handler objects each doing their own little piece of validation, the main function to parse an RSS feed is mostly boilerplate:
def validate(aString):
# boilerplate
from xml.sax import make_parser, handler
from base import SAXDispatcher
from exceptions import UnicodeError
from cStringIO import StringIO
source = InputSource()
source.setByteStream(StringIO(aString))
# create an instance of our top-level SAX dispatcher
validator = SAXDispatcher()
# boilerplate
parser = make_parser()
parser.setFeature(handler.feature_namespaces, 1)
# set up our validator as the handler for all SAX events,
# and start parsing
parser.setContentHandler(validator)
parser.setErrorHandler(validator)
parser.setEntityResolver(validator)
parser.parse(source)
return validator
During the course of parsing, our SAXDispatcher instance
accumulates errors and warnings through a centralized logging interface
(not shown). Each error is stored as its own object, and we can access
the list of errors and display them however we choose. The interactive web-based
validator displays them in an HTML page; the (currently beta) SOAP interface uses
the errors to construct a SOAP response. The downloadable command-line version
just prints them to the screen.
Next month: something other than RSS.