XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Inside the RSS Validator

Inside the RSS Validator

February 26, 2003

In previous columns, I have introduced RSS and explored options for consuming it. Now we turn to the production side. Last month I stirred up a small controversy by suggesting that RSS consumers should go out of their way to consume as many feeds as possible, even ones which are not well-formed. This month I hope it will be somewhat less controversial to say that RSS producers should go out of their way to produce feeds that conform to specifications as well as possible.

Rule Zero is that all RSS feeds must be well-formed XML. Not all RSS consumers use the advanced techniques we discussed last month. Many can only parse RSS feeds that are well-formed XML. There are many tools for producing XML; you should use one of them as opposed to, say, using string concatenation and a non-XML-aware templating system and hoping for the best.

Beyond well-formedness, there are a number of domain-specific rules and best practices for RSS feeds. These are fairly well encapsulated in the free online RSS validator. Point the validator at your RSS feed and follow its instructions if it finds any errors or warnings. It will catch common XML errors such as unescaped ampersands and high-bit characters; domain-specific errors such as missing required elements; and more subtle errors such as improper language codes in the <language> element.

Lather, rinse, repeat till the validator clears your feed for takeoff. Check back every now and then to make sure other obscure bugs haven't crept up and made your feed go invalid, which may indicate bugs in your production software.

How the validator works internally is actually fairly interesting -- much more interesting than the arcane rules of RSS validity -- and that's where I'd like to focus. The validator is written in Python, and it is available under a liberal open source license, so you can download the complete source code and follow along.

The RSS validator relies on Python's built-in SAX interface, xml.sax.handler. To use it, you subclass ContentHandler and provide methods for startElementNS (for start tags), endElementNS (for end tags), and characters (for everything in between). Of course, for anything but the most trivial applications, these will end up being dispatch methods to the real code stored elsewhere, which you've separated based on some criteria (namespace, element name, phase of the moon).

As the SAX parser processes the input document, the RSS validator maintains a stack of handler objects. Each handler object knows just enough to validate a specific element, and it knows which other handler objects can validate the element's children. Each handler object is set up with contextual information about which element it's handling, what its parent element is, and what attributes, if any, were present in its start tag. The handler object introspects over its own methods to find one that can handle the current element and calls it. This method can perform the validation directly, or it can return one or more handler objects to perform additional validation. This will become clearer with some code.

Here is the first step, a subclass of xml.sax.handler.ContentHandler, which initializes the handler stack and then passes all startElementNS requests to the top handler in the stack.


class SAXDispatcher(ContentHandler):

  def __init__(self):
    ContentHandler.__init__(self)
    # prime the handler stack with the root handler object
    self.handler_stack = [[root(self)]]

  def startElementNS(self, name, qname, attrs):
    qname, name = name
    for handler in self.handler_stack[-1]:
      # call all the handlers for the current element
      handler.startElementNS(name, qname, attrs)

The second step is a base class for all handler objects. It's really a second-level dispatch; it introspects over its own methods to find a method which matches the current element's name, do_element. If found, it calls the method, which returns one or more handler objects. Each of these handlers is set up with contextual information and pushed onto the stack.

class validatorBase(ContentHandler):
  def __init__(self):
    ContentHandler.__init__(self)
    self.value = ""
    self.attrs = None
    self.children = []

  def startElementNS(self, name, qname, attrs):
    from validators import eater
    if qname:
      handler = self.unknown_starttag(name, qname, attrs)
    else:
      try:
        # look for specific method for this element (by local-name)
        handlers = getattr(self, "do_" + name)()
      except AttributeError:
        # no specific method for this element, use default handler
        handlers = [eater()]
    # small hack: if method returns 1 handler, make it a list of 1
    try: iter(handlers)
    except TypeError: handlers = [handlers]
    # set up contextual information for each handler object
    for aHandler in handlers:
      aHandler.parent = self
      aHandler.value = ""
      aHandler.name = name
      aHandler.attrs = attrs
      aHandler.prevalidate()
    self.children.append(name)
    # push handlers onto the stack
    self.push(handlers)

Two other methods are present in validatorBase: the characters method, which just buffers the raw text data within the current element, and the endElementNS method, which gets called when we get to the element's end tag and which calls a validate method (defined in the descendant handler objects).

  def characters(self, string):
    # buffer the text data for this element
    self.value = self.value + string

  def endElementNS(self, name, qname):
    # we've buffered all the text data for this element, so validate it
    self.validate()

Now we can start defining a hierarchy of handler objects to validate different parts of the RSS feed. Each handler needs a validate method to validate the element's data and a do_ method for each possible child element. For example, the root handler does no validation, but it knows about the rss element, which is the top-level element of most RSS feeds. (This code example is simplified; in reality we also need to handle an rdf element, which is the top-level element of RSS 0.9 and 1.0 feeds.)

class root(validatorBase):
  def do_rss(self):
    from rss import rss
    return rss()

The rss handler knows that every rss element needs a channel child element and a version attribute. It also has a do_channel method which dispatches the validation for the child channel element.

class rss(validatorBase):
  def validate(self):
    if not "channel" in self.children:
      self.log(MissingChannel({"element":self.name, "attr":"channel"}))
    if (None, 'version') not in self.attrs.getNames():
      self.log(MissingAttribute({"element":self.name, "attr":"version"}))

  def do_channel(self):
    from channel import channel
    return channel()

The channel handler knows that every channel needs a title, link, and description (and a few other rules), and it has do_ methods for each possible child element of channel: title, link, description, item, items, textInput and textinput (due to subtle differences in various RSS versions -- seven specs, no waiting), category, cloud, rating, ttl, docs, generator, pubDate, lastBuildDate, managingEditor, webMaster, language, copyright, skipHours, skipDays, and blink. (There is no blink tag in RSS, but there was some confusion about this, so the validator presents a specific error message for it.)

class channel(validatorBase):
  def validate(self):
    if not "title" in self.children:
      self.log(MissingTitle({"parent":self.name, "element":"title"}))
    if not "link" in self.children:
      self.log(MissingLink({"parent":self.name, "element":"link"}))
    if not "description" in self.children:
      self.log(MissingDescription({"parent":self.name,"element":"description"}))
    # several rules omitted here
...
  def do_title(self):
    return nonhtml(), noduplicates()

  def do_link(self):
    return rfc2396(), noduplicates()

  def do_description(self):
    return nonhtml(), noduplicates()
...
  # lots of other do_ methods omitted

As you can see, several of the do_ methods return a list of individual handlers. A channel link must be an RFC-2396-compliant URI, and there can be only one link element per channel. Each of these rules is encoded in its own handler object:

class rfc2396(validatorBase):
  rfc2396_re = re.compile("[a-zA-Z][0-9a-zA-Z+\\-\\.]*:(//)?" +
    "[0-9a-zA-Z;/?:@&=+$\\.\\-_!~*'()%,#]+$")
  def validate(self, errorClass=InvalidLink):
    if (not self.value) or (not self.rfc2396_re.match(self.value)):
      self.log(errorClass({"element":self.name, "value":self.value}))

class noduplicates(validatorBase):
  def prevalidate(self):
    if self.name in self.parent.children:
      self.log(DuplicateElement({"parent":self.parent.name, "element":self.name}))

When the channel.do_link method returns the list of rfc2396 and noduplicates handler objects, the secondary dispatcher in validatorBase.startElementNS pushes them onto the stack, where the main dispatcher in SAXDispatcher.startElementNS pops them off and calls each of them in turn. Both the rfc2396 instance and the noduplicates instance are each set up with contextual information for the current link element; they each perform their own validation log their own errors.

This all may seem like a lot of indirection -- and it is -- but it has several advantages:

    More Dive Into XML Columns

    Identifying Atom

    XML on the Web Has Failed

    The Atom Link Model

    Normalizing Syndicated Feed Content

    Atom Authentication

  1. It's easy to add new functionality. Adding support for a new element requires writing a new handler object that inherits from validatorBase, then adding a single do_ method in the parent element's handler.
  2. It's easy to debug. None of the individual handler objects interact with each other. There are no side effects.
  3. It encourages code reuse. Several different elements in various levels of an RSS document have similar validation logic. For instance, docs, link, and the elements within the optional blogChannel module all need to be RFC-2396-compliant URIs.

With this framework in place, and an entire hierarchy of handler objects each doing their own little piece of validation, the main function to parse an RSS feed is mostly boilerplate:

def validate(aString):
  # boilerplate
  from xml.sax import make_parser, handler
  from base import SAXDispatcher
  from exceptions import UnicodeError
  from cStringIO import StringIO
  source = InputSource()
  source.setByteStream(StringIO(aString))

  # create an instance of our top-level SAX dispatcher
  validator = SAXDispatcher()

  # boilerplate
  parser = make_parser()
  parser.setFeature(handler.feature_namespaces, 1)

  # set up our validator as the handler for all SAX events,
  # and start parsing
  parser.setContentHandler(validator)
  parser.setErrorHandler(validator)
  parser.setEntityResolver(validator)
  parser.parse(source)

  return validator

During the course of parsing, our SAXDispatcher instance accumulates errors and warnings through a centralized logging interface (not shown). Each error is stored as its own object, and we can access the list of errors and display them however we choose. The interactive web-based validator displays them in an HTML page; the (currently beta) SOAP interface uses the errors to construct a SOAP response. The downloadable command-line version just prints them to the screen.

Next month: something other than RSS.