Sign In/My Account | View Cart  
advertisement


Listen Print Discuss
Parsing RSS At All Costs

Parsing RSS At All Costs

by Mark Pilgrim
January 22, 2003

What are Syndication Feeds

The problem

As I said in last month's article, RSS is an XML-based format for syndicating news and news-like sites. XML was chosen, among other reasons, to make it easier to parse with off-the-shelf XML tools. Unfortunately in the past few years, as RSS has gained popularity, the quality of RSS feeds has dropped. There are now dozens of versions of hundreds of tools producing RSS feeds. Many have bugs. Few build RSS feeds using XML libraries; most treat it as text, by piecing the feed together with string concatenation, maybe (or maybe not) applying a few manually coded escaping rules, and hoping for the best.

On average, at any given time, about 10% of all RSS feeds are not well-formed XML. Some errors are systemic, due to bugs in publishing software. It took Movable Type a year to properly escape ampersands and entities, and most users are still using old versions or new versions with old buggy templates. Other errors are transient, due to rough edges in authored content that the publishing tools are unable or unwilling to fix on the fly. As I write this, the Scripting News site's RSS has an illegal high-bit character, a curly apostrophe. Probably just a cut-and-paste error -- I've done the same thing myself many times -- but I don't know of any publishing tool that corrects it on the fly, and that one bad character is enough to trip up any XML parser.

I just tested the 59 RSS feeds I subscribe to in my news aggregator; 5 were not well-formed XML. 2 of these were due to unescaped ampersands; 2 were illegal high-bit characters; and then there's The Register (RSS), which publishes a feed with such a wide variety of problems that it's typically well-formed only two days each month. (I actually tracked it for a month once to test this. 28 days off; 2 days on.) I also just tested the 100 most recently updated RSS feeds listed on blo.gs (a weblog tracking site); 14 were not well-formed XML.

Clearly, we need a backup plan.

What Are Syndication Feeds

Essential Reading

What Are Syndication Feeds
By Shelley Powers

Table of Contents

Syndication feeds have become a standard tool on the Web. But when you enter the world of syndicated content, you're often faced with the question of what is the "proper" way to do syndication. This edoc, which covers Atom and the two flavors of RSS--2.0 and 1.0--succinctly explains what a syndication feed is, then gets down to the nitty-gritty of what makes up a feed, how you can find and subscribe to them, and which feed will work best for you.


Read Online--Safari
Search this book on Safari:
 

Code Fragments only

The Heretical Solution

There is a social solution to this problem: register at Syndic8.com to be a "fixer", and volunteer your time contacting the authors of individual sites to get them to fix their feeds. There is also a technical solution to this problem: don't use an XML parser.

I know, I know, this is heresy. The point of XML is that content producers are supposed to put up with the pain of XML formatting rules so that content consumers can do cool things with off-the-shelf tools. Well, guess what? It's not happening. Judging by the sad state of affairs in the RSS world, content producers are either ignorant of the error of their ways, or too lazy to fix the errors, or too busy, or locked into inflexible tools whose vendors are too busy... Whatever the reasons, content consumers are rarely in a position to solve the problem. So we must work around it. We need a parse-at-all-costs RSS parser.

I know, I know, this is how HTML got to be "tag soup": browsers that never complained. Now the same thing is happening in the RSS world because the same social dynamics apply. End users who can't even spell "XML" certainly don't care about silly little formatting rules; they just want to follow their favorite sites in their news aggregator. When 10% of the world's RSS feeds are not well-formed -- including some high-profile feeds that thousands of people want to read -- the ability to parse ill-formed feeds becomes a competitive advantage. (And if you think the same thing won't happen when RDF and the Semantic Web go mainstream, you're deluding yourself. The same social dynamics apply. Boy, is that going to be messy.)

So most desktop news aggregators are now incorporating parse-at-all-costs RSS parsers which they use when XML parsing fails. However, since no one likes tag soup, they are also implementing subtle visual clues, such as smiley and frown icons, to indicate feed quality. Click on the frown face, and the end user can learn that this RSS feed is not well-formed XML. But the program still displays the content of the feed, as best it can, using a parse-at-all-costs parser. Those who care about quality and are motivated to do something about it can contact the publisher. But everyone else can follow their favorite sites, even if the feeds are broken.

The Heretical Code

So how do you build a parse-at-all-costs RSS parser? With regular expressions, of course. Regular expressions are the messy solution to all of life's messy problems. Want to parse invalid HTML and XML? Regular expressions. Want to parse invalid RDF? Regular expressions. And may God have mercy on your soul.

Actually, Python has a secret weapon against poor markup: a little-known standard library called sgmllib. I've written extensively about sgmllib elsewhere for HTML processing, but it's also useful for processing invalid XML.

sgmllib is based on regular expressions under the covers, but you don't need to deal with them directly. It works much like a SAX parser for XML documents. In fact, you can think of it as a SAX parser that doesn't care about details like unescaped ampersands or undefined entities. The sgmllib.SGMLParser class iterates through a document, and you can subclass it to provide element-specific processing. For example, here is an invalid XML document (due to both the undefined entity "—" and the unescaped ampersand):

<rss>
  <channel>
    <title>My weblog &mdash; tech news & other stuff</title>
  </channel>
</rss>

Here is how sgmllib.SGMLParser would handle it:

  1. Call start_rss([]). The empty list indicates no attributes for this tag. If I wanted to do something special when I encountered the beginning rss tag, I would define the start_rss method in my sgmllib.SGMLParser descendant. (If start_rss hasn't been defined, SGMLParser will fall back to calling unknown_starttag('rss', []) instead. This also applies to all subsequent examples.)
  2. start_channel([])
  3. start_title([])
  4. handle_data('My weblog ')
  5. handle_entityref('mdash')
  6. handle_data(' tech news ')
  7. handle_data('&')
  8. handle_data(' stuff')
  9. end_title()
  10. end_channel()
  11. end_rss()

Note that both steps 5 and 8 will choke any compliant XML parser, but sgmllib just says, "Unknown entity? Here, you deal with it. Unescaped ampersand? Must be plain text."

Given this new-found freedom, we can use sgmllib to build a parse-at-all-costs RSS parser. We'll start by subclassing sgmllib.SGMLParser and defining our own methods to keep track of RSS data as we find it. We'll need start_item and end_item methods in order to keep track of whether we're within an RSS item. We'll use a currentTag variable to keep track of the most recent start tag; a currentValue variable which buffers all the text data we find until we hit the end tag (as shown in steps 4-8 of the example above, the text data may be split across several method calls); and a list of dictionaries to hold all of our parsed data.

import sgmllib

class ParseAtAllCostsParser(sgmllib.SGMLParser):
    def reset(self):
        self.items = []
        self.currentTag = None
        self.currentValue = ''
        self.initem = 0
        sgmllib.SGMLParser.reset(self)

    def start_item(self, attrs):
        # set a flag that we're within an RSS item now
        self.items.append({})
        self.initem = 1

    def end_item(self):
        # OK, we're out of the RSS item
        self.initem = 0

Now add in the unknown_starttag and unknown_endtag methods to handle the start and end of an individual item element:

    def unknown_starttag(self, tag, attrs):
        self.currentTag = tag

    def unknown_endtag(self, tag):
        # if we're within an RSS item, save the data we've buffered
        if self.initem:
            # decode entities and strip whitespace
            self.currentValue = decodeEntities(self.currentValue.strip())
            self.items[-1][self.currentTag] = self.currentValue
        self.currentValue = ''

As you can see, once we find the end tag, we take all the buffered text data from within this element (self.currentValue), decode the XML entities manually (since sgmllib will not do this for us), strip whitespace, and stash it in our self.items list. So this requires several things: a decodeEntities function and the appropriate handler methods for buffering the text data in the first place.

Decoding XML entities is easy; there are only five of them:

def decodeEntities(data):
    # in case our document *was* encoded correctly, we'll
    # need to decode the XML entities manually; sgmllib
    # will not do it for us
    data = data.replace('&lt;', '<')
    data = data.replace('&gt;', '>')
    data = data.replace('&quot;', '"') #"
    data = data.replace('&apos;', "'")
    data = data.replace('&amp;', '&')
    return data

Handling the text data that sgmllib.SGMLParser throws down on us (including any entities within the text) is equally easy:

    def handle_data(self, data):
        # buffer all text data
        self.currentValue += data

    def handle_entityref(self, data):
        # buffer all entities
        self.currentValue += '&' + data
    handle_charref = handle_entityref

The final result is that we can feed an invalid RSS document into this parser, and it will parse out any and all item-level elements, well-formed or not.

if __name__ == '__main__':
    p = ParseAtAllCostsParser()
    p.feed(file('invalid.xml').read())
    for rssitem in p.items:
        print 'title:', rssitem.get('title')
        print 'description:', rssitem.get('description')
        print 'link:', rssitem.get('link')
        print

Running this script on this non-well-formed RSS document will produce these results:

title: Layoffs in BT & H
description: BT & H has laid off more people as the recession only gets worse. Note the ampersands in both title and description.
link: http://example.com/news/3

title: Mozilla Project Hurt by Apple's Decision to use KH
description: It's generally best to read Slashdot at a +3 comments threshhold. Note undefined entities in the link (due to unescaped ampersands).
link: http://developers.slashdot.org/article.pl?sid=03/01/14/1514205&tid=154&threshold=3

This simple script will not handle many of the advanced features of XML, including namespaces. That may not be a problem; after all, it's just a fallback, right? Hopefully we're trying to use a real XML parser first and only falling back on this messy regular expressions-based sgmllib parser when that fails. However, in flagrant abuse of all things pure and sacred, I have managed to extend this script into a full-fledged parse-at-all-costs RSS parser that supports all the advanced features of RSS, including namespaces. It even handles exotic variations of RSS 0.90 and 1.0, where everything is explicitly placed in a namespace (even the basic title, link, and description tags). I don't recommend it, but it works for me.

In next month's column I'll examine some other RSS validity issues. Valid RSS is more than just well-formed XML. Just because there's no DTD or schema doesn't mean it can't be validated in other ways. We'll discuss the inner workings of one such RSS validator. And then we'll move on to something non-RSS-related. I promise.

Related resources:


Comment on this articleHave you developed a parse-at-all-costs parser for ill-formed RSS or other XML formats? Share your experience in our forum.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Titles Only Newest First
  • Haphazard parsing tools far worse than broken RSS feeds
    2003-02-28 18:13:26 Dudley Carr [Reply]

    I completely agree with Mark that there is no need to sit around and despair about the current state of RSS feeds. I think that the notification of faulty RSS feeds is also a decent idea, but that's only part of the solution.


    What I do have serious problem with is people hacking together custom solutions to parse ill-formatted RSS.


    As everyone knows that once a piece of software becomes used widely is extremely difficult to displace that piece of software with something better. On the other hand, an RSS feed is just a piece of data that’s changing on daily basis and can easily be fixed if broken.


    Solution: Everyone should use proper XML tools to parse RSS. However, until the current state of RSS feeds improve, use a tool such as HTMLTidy or something similar but more geared towards RSS. So when that magical day comes when most feeds are in a much better condition, then you can just turn off the tool for correcting crappy RSS.


    The benefits are several-fold:
    1) A common tool for correcting bad RSS such as HTMLTidy did for HTML.
    2) Eliminate the arms race for the best Regex RSS parser.
    3) Normal people can keep on working under the guise of working with proper XML.
    4) Partly discourage people from producing crappy RSS just b/c they’ve given up on people parsing RSS using proper XML parsers.


  • Parse-At-Many-Costs with PHP
    2003-02-09 17:05:33 Lex Friedman [Reply]

    I've written an aggregator that attempts to parse at all costs, even if it's very, very bad RSS.


    http://thefriedmans.net/lexfeed/


    I encourage you to try it out, find what breaks it, and let me know. Since I check feeds at work, at home, at friends' homes, and at public machines, I want to keep track of my subscriptions everywhere, easily. LexFeed lets me do just that -- and is quite forgiving of lousy RSS. I think.

  • XML must be well formed
    2003-01-30 11:09:20 Victor Lindesay [Reply]

    Sorry Mark, I have to go with Dare on this one.


    That XML is well formed is a fundamental principle and a contract to which all XML software applications must adhere.


    An XML document (well formed by definition) is a thing of value, easily manipulated and used. An attempt at an XML document (not well formed) is just text with funny brackets and is well, basically useless.


    Sure we have all coded without escaping reserved characters in text nodes - it's easy to overlook and that's part of learning about this new technology. But handling reserved chars in XML processing is basic good practice and if you can't do it in 'production' code - then back to school.


    RSS applications and the unreliability of RSS data are the product of where we are with web services now. These are problems that perhaps licensing and service level aggreements will solve.


    And anyway, if the client (and client software) is disappointed that a particular RSS is bad, then how do you think the provider of the RSS feels knowing that his data is unusable. The client can always move on to another feed that works.


    Thanks for your article, Mark. As usual excellently presented and coded!

    • Content monopolies
      2003-01-30 11:39:36 Mark Pilgrim [Reply]

      Re: "The client can always move on to another feed that works."


      That's just it: they can't. Each publisher has a natural monopoly on their own content. If I want to read The Register in my news aggregator, there's only one legitimate source.

  • Tag soup and the TagSoup parser
    2003-01-24 09:19:57 John Cowan [Reply]

    Allow me to toot my own horn here a bit: I have developed a SAX2 parser in Java that is designed for on-the-fly parsing of tag soup; it's called TagSoup, naturally (http://www.ccil.org/~cowan/XML/tagsoup).
    As delivered, its tables are designed for ugly HTML, but it wouldn't be hard to create tables for sucky RSS as well. Anyone want to cooperate with me on this?

  • Solution.RSS
    2003-01-23 17:10:39 Dean Goodmanson [Reply]

    ...and I thought the RDF/RSS tif's were rough!


    So until the world agrees to purify their RSS feeds in XML, why can't the community acknowlege the feed as simpley RSS, not XML? Don't label it .XML, but .RSS :


    http://www.diveintomark.org/xml/rss.xml


    to..


    http://www.diveintomark.org/xml/rss.rss


    But then there's that whole silly


    <?xml version="1.0" encoding="utf-8" ?>


    issue...


    <?xml version="1.0" encoding="utf-8" compliance="Willy-Nilly" ?>


    ;-)


    Consumer level applications should be able to return as much info as possible. (e.g. Browsers and RSS readers.) I applaud your article.


    Industrial level applications should require strict adherence to standards. (e.g. Software compilers, b2b, ...)


    ..and where the lines cross: make it configurable!


    Not sure how the above fits into that MS guy's argument, but it seems a bit apples and oranges to me.


    Keep the sucrose a flowin...

    • Solution.RSS
      2003-01-23 17:16:45 Dean Goodmanson [Reply]

      Consumer, Industry
      Human, Machine


      Hmm... http://www.xml.com/pub/a/2003/01/15/creative.html




  • One-Off Parse-at-All-Cost
    2003-01-23 08:56:22 Don Park [Reply]

    [ this is a copy of my blog post at http://www.docuverse.com/blog/donpark ]


    Mark Pilgrim raises the inevitable question about ill-formed RSS and how to deal with it. Mark offers parse-at-all-cost as a solution. I think this problem can be solved completely if:


    1. RSS feed proxy services with 'tidy' (parse-at-all-cost) and occasional validaton service becomes common place allowing either the feed producer or the consumer to deal with ill-formed RSS.

    2. Encourage development and use of RSS/XML writer libraries instead of writing out tags and contents directly.

  • RDF parsers
    2003-01-23 02:58:29 Danny Ayers [Reply]

    Hi Mark, nice piece.


    An alternative to using a 'flat' XML parser with RSS 1.0 is to use an RDF parser such as ARP (Java) or the one the one which comes with cwm (Python). Not a great deal of use for feeds from the wild like those you discuss (if it's invalid XML/RSS then it'll fall at the first hurdle), but for known sources this should be a preferred option.

  • auto email service?
    2003-01-22 21:00:01 Matthew Haughey [Reply]

    I have used arsdigita's free uptime service for 6 years now (now at http://uptime.openacs.org/uptime/), and since my MetaFilter feed relies on a community of users inputing all sorts of junk, the feed is frequently broken by a curly quote, emdash, and/or a stray umlaut. My email is listed on the feed, which I get occasional messages from users complaining that it is broken.


    While I look for a perfect search and replace high ascii-to-entity encoder, it'd be great if maybe part of the RSS validator allowed me to signup for RSS monitoring, checking maybe a couple times a day for errors and emailing me if it didn't pass muster with the validator.

    • Re: auto email service?
      2003-01-22 21:09:33 Mark Pilgrim [Reply]

      Good idea. You are free to start such a service. The RSS validator is open source.


      http://feeds.archive.org/validator/about.html#opensource

      • Re: auto email service?
        2003-01-22 22:21:25 Matthew Haughey [Reply]

        Is there an api to the existing service? I'd be open to building a quick site to check urls being watched, but ideally I'd like to send a URL in a soap or xml-rpc packet and just wait for a response back to trigger emails if an error code is reported or do nothing if it works fine.

  • This Article is Quite Uplifting
    2003-01-22 18:43:41 Aaron Swartz [Reply]

    I've been saying for years that XML is a failure; it's great to see XML.com saying it.


    Perhaps now we can move past this inelegant, timesucking, breakage-causing, useless rathole we call XML and onto solving real problems.

    • This Article is Quite Uplifting
      2003-01-23 08:52:22 Alan Kotok [Reply]

      Aaron, et al:


      About two years, I started a little daily news wire for our Web site and began capturing the content in an XML database, using the News Industry Text Format. Later, we extracted the content into a weekly newsletter, and began syndicating the content with RSS 1.0. We now have the daily news wire, two editions of the newsletter, two versions of the syndicated content, and a daily edition formatted for handheld devices.


      All of this publishing activity is made possible by using XML (which, by the way, we test for well-formedness each day) that we do on a shoestring. Without XML and XSLT stylesheets we could not even dream of providing this service to our community. If that's not 'solving real problems' to use your words, I don't know what is.


      Alan Kotok
      Editor, E-Business*Standards*Today
      http://www.disa.org/dailywire/


  • This Article is Quite Depressing
    2003-01-22 17:45:45 Dare Obasanjo [Reply]

    It is unfortunate to see XML.com running an article that endorses bringing the haphazard world of Tag Soup from HTML into the world of XML. The primary benefits of XML are its widespread, CONSISTENT usage which allows for the availability of several off-the-shelf tools and reduces vendor lockin.


    Encouraging consumers of XML to support ill-formed XML reduces the power of XML and induces fragmentation. If we arbitrarily pick bits and pieces of a standard to support then we cheapen the technology and reduce it to worthlessness.


    I'd hate to see XML on the 'web reduced to HTML during the browser wars with people simply checking if "it works well with Mark Pilgrim's program" or creating ill-formed markup simply to satisfy broken tools.


    • This Article is Quite Depressing
      2003-02-24 07:44:51 Frank Wilhoit [Reply]

      What we need to face up to is the fact that the MXL community has broken in half without quite realizing it, and the two halves are talking past each other in perfect incomprehension and increasing frustration.


      One camp says that XML is about formal syntax, the other that says it is about informal semantics. Both are right, because XML can do both; they are talking about disjoint applications. It is not necessary for one side to "win", but for both sides to realize that XML is a sufficiently protean technology to do things that its originators did not foresee.


    • Agreed. So what's your solution?
      2003-01-22 18:36:10 Mark Pilgrim [Reply]

      The tone of the article is based on the demonstrated realities of the RSS world, which I agree is depressing. Are you proposing a solution (other than the two I proposed)? Or are you just idly wishing that life was easier for developers?

      • Agreed. So what's your solution?
        2003-01-30 09:00:19 Jon Wickström [Reply]

        In this case with RSS. I believe the providers of the feed should care enough for it to check that it is not broken. And if the document is broken, how much should you fix it? There might be bits and pieces missing or completely wrong. If the document is silently fixed, how are you to know what you are missing?


        If the document has two root elements, which one would you choose? Both? Should open tags be closed? Maybe the content of the document still is broken?


        On the other hand. It would be very convenient in an RSS client when an invalid document is encountered to have a pop-up asking "Fix broken document? Yes/No". But I think the key point is to inform the user that the document is broken!
        And if the RSS feed is fed into something else a notation that the document has been modified must be included...


        Should this bee seen in a bigger context. Should all XML documents be fixed by the parser? Only well-formedness or also if not valid?


        From a programmers standpoint it is very nice to know that you can (and should?) throw away a broken document because parsing it otherwise probably would propagate errors.

      • Agreed. So what's your solution?
        2003-01-22 20:27:27 Dare Obasanjo [Reply]

        It depends on what you consider to be the problem. From my perspective, the problem is websites that provide non-standards compliant XML in their RSS feeds while from yours it is consuming this XML even if it does not comply with the W3C XML 1.0 recommendation.


        The solutions from my point of view would rely on pressuring sites and tools that produce invalid RSS feeds to correct them and creating tools like the RSS validator produced by yourself and Sam Ruby (which is an excellent contribution to the community).


        The temporary benefit of being able to read ill-formed RSS feeds is outweighed by the harm caused to XML and the Web by fostering the idea that it is OK to produce and consume XML that does not conform to W3C standards. XML has been successful thus far because of the fairly strict adherence to standards by vendors, producers and consumers of XML documents. It is unfortunate that your article is attempting to undermine this even though your intentions are good.

        • Benefits and harms are not evenly distributed
          2003-01-22 21:24:36 Mark Pilgrim [Reply]

          re: "The temporary benefit of being able to read ill-formed RSS feeds is outweighed by the harm caused to XML and the Web"


          The problem is that the benefit is accrued by the software vendor, and is direct and immediate, but the harm is caused to everyone equally, and is long-term and abstract. Direct and immediate wins every time.

          • Benefits and harms are not evenly distributed
            2003-01-23 08:03:10 bryan rasmussen [Reply]

            Direct and immediate wins everytime reminds me of Hardin's arguments vis-a-vis the commons, since come under some controversy.
            It is in the main a philosophical argument, but as such I can not see how it is a sensible one.


            You say the direct and immediate wins everytime, implying that newsreaders will have to parse everything that proclaims itself RSS whether it is or not because of business pressures to do so. But if a public newsreader did not parse the RSS instead returning a broken message to the clients of said feed then would this not create direct and immediate pressures on feed authors and sites to produce valid xml, and would this not spur product sales for RSS producers that produced valid RSS?


            Part of the reason for xml (which after all is a simpler set of rules than most other languages) that is not well-formed with RSS is of course that RSS (2.0 and pre 1.0) allows escaped html inside of the description element, a practice I believe much more likely to cause broken feeds. As I've harped on before this hampers the transportability of feeds across media, to for example a non-html email newsletter format, various phone media, or even specific browsers.


            It seems to me that a vendor that produced both a RSS producer and consumer that could be relied on to produce only well-formed feeds could derive direct and immediate benefits against other vendors, because of reuse of xml in other media.

            • Benefits and harms are not evenly distributed
              2003-01-23 11:40:03 Aaron Swartz [Reply]

              You write: "But if a public newsreader did not parse the RSS instead returning a broken message to the clients of said feed then would this not create direct and immediate pressures on feed authors and sites to produce valid xml"


              No, it would not. The person who puts out the feed rarely reads it.

            • End-user perspective
              2003-01-23 08:37:02 Mark Pilgrim [Reply]

              > "implying that newsreaders will have to parse everything that proclaims itself RSS whether it is or not because of business pressures to do so."


              Exactly.


              > "But if a public newsreader did not parse the RSS instead returning a broken message to the clients of said feed then would this not create direct and immediate pressures on feed authors and sites to produce valid xml"


              No. You are punishing the wrong people. You are still operating under the mistaken impression that XML, in and of itself, is important. It is not. It is a means to an end. End users don't care. And they shouldn't have to care.


              Look, I was in this position: I tried several news first-generation aggregators that only used real XML parsers. Feeds would go unreadable for days at a time, and by the time they came back I had missed dozens of articles. I tried to switch to another aggregator that could allow me to follow the sites I wanted to follow, but none satisfied me, so I ended up writing the parse-at-all-costs RSS parser and building a homegrown aggregator around it for my own use.


              And I'm *technically inclined*. I *care* about XML. Imagine the reaction of an end user who isn't, and doesn't. They bought (downloaded/whatever) a program that purports to help them read all the news and follow all the sites that they care about. They like this idea. Then they find out that sometimes it doesn't work, sometimes sites that worked yesterday don't work today, and some sites don't work at all, because of something called "XML". They don't know from XML, they've never seen XML, they don't care about XML, but this stupid POS program is complaining and saying there's nothing it can do about this "XML" problem and suggesting, in its infinite wisdom, that the end user should take it upon themselves to work around this problem by sending an email to the site owner and waiting an indeterminate length of time before they can read the news they care about, if ever.


              You're kidding, right?


              Then the user hears about another aggregator, a direct competitor, which claims to be able to let them follow *all* the sites they care about. It doesn't complain; it doesn't whine; it doesn't suggest that they work around the developer's laziness by firing off emails to random people they've never met. It just works.


              Which would *you* choose?

              • End-user perspective
                2003-01-24 03:55:28 bryan rasmussen [Reply]

                >You're kidding, right?


                No, but that is because I'm not really viewing an aggregator as a tool in itself, I don't think aggregators have much of a business future. I think they're destined to become part of other products.
                >Then the user hears about another aggregator, a >direct competitor, which claims to be able to >let them follow *all* the sites they care about. >It doesn't complain; it doesn't whine; it >doesn't suggest that they work around the >developer's laziness by firing off emails to >random people they've never met. It just works.



                Again, I don't believe in aggregators as stand-alone tools, I believe that they will become part of more wide-ranging products.
                If such a product has to do with handling XML of widely different formats then it cannot devote development resources to handling stuff that thinks it's XML but really isn't.
                A product can provide add-ins to convert legacy formats, but I don't think badly formed RSS will qualify for such attentions.


                If such a product is the object then the well-formedness of the XML becomes integral to the product, development will have to provide ways to error report problems with individual XML instances, such as those originating from a feed.
                This is not developer laziness, but developer ambition.
                Error reporting to a user has always seemed to me to be an exercise in the art of communication. If a non-technical user receives the error message
                "XML error at

                hello world

                " then they might well be expected to say "This program sucks" if on the other hand they receive information like "Newsfeed at http://www.myinfo.com/newsfeed7 is not conforming to the technical standards for newsfeeds, if you would like to learn more click More Info" then I would expect the user to think something like "Frigging amateurs at www.myinfo.com" despite not automatically fixing www.myinfo.com for the user the program may still command market share if it does enough other things with various other XML technologies. This may cause you to think again that I'm kidding but I'm not, I think a lot of these problems stem from the technical communities believe that the end user is an idiot. The end user may not understand XML or any other standard, but I have faith enough in the intelligence of people to understand a claim that such and such a thing does not conform to a standard.


                But I guess we can't agree on that matter.





              • Breaking Industry Standards A Competitive Advantage?
                2003-01-23 10:08:12 Dare Obasanjo [Reply]

                I've heard your arguments before from other people and don't agree with them. Thankfully, those of us who work on core XML technologies at Microsoft don't have this attitude towards XML and related standards simply because we want to gain "competitive advantage". If we did many of the gains that XML brings to the our users due to its reusability and ability to foster interoperability would be lost.


                Your article highlights a mini tragedy of the commons. If XML applications that process RSS documents begin to lean towards processing ill-formed XML then when RSS files are reused such as many XML formats are wont to be (e.g. some mention using RSS for weblog archives, others have suggested using it as a general push technology) then this sloppiness and lack of standards adherence will creep into this avenues as well.


                All in all it's interesting to read a column called Dive Into XML on a website called XML.com which encourages poisoning the XML in the name of "competitive advantage".

                • Robustness Principle
                  2003-01-23 10:40:11 Mark Pilgrim [Reply]

                  This has nothing to do with the tragedy of the commons (boy, there's an overused phrase). It has everything to do with the Robustness Principle that Postel nailed years ago in RFC 793: "TCP implementations will follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others." The same applies here: validators and programs that produce RSS should be as conservative as possible; end user tools that consume RSS should be as liberal as possible. They serve different masters.


                  I'm tired of arguing with you, Dare. Despite your misrepresentation, we can all see for ourselves that my article clearly demonstrates an actual problem, describes a workaround for consuming tools, and pushes for not one but two long-term social solutions (the centralized advocacy effort at Syndic8, and the decentralized solution of making non-well-formedness visible to the end user).


                  Meanwhile, it's ironic that you hold up Microsoft as the epitome of XML standards compliance. What short memories we have! Have a quick look back in the XML mailing list archives to see all the confusion their ultra-liberal MSXML parser caused with people who mistook it for an actual validating XML parser. ("Whatdya mean my XML's not well-formed? It looks fine when I open it in IE!") That was not the place to parse at all costs; this is.

                  • You Prove My Point
                    2003-01-23 11:11:58 Dare Obasanjo [Reply]

                    Actually a number of our customers regularly praise the standards compliance of MSXML.


                    Unfortunately, we also have customers who mistakenly assume that viewing XML in Internet Explorer causes it to be processed by the validating XML parser instead of the well-formed XML parser which is not the case. This design decision was before my time but was most likely motivated by good intentions similar to yours about reducing user pain and ensuring that even invalid but well-formed XML was viewable in the browser. No one thought to think about what would happen downstream when people assumed that


                    viewable in IE == well-formed & validated XML


                    instead of just


                    viewable in IE == well-formed XML


                    Your attempted slur actually helps bolster my point as to why your article should not be encouraging supposedly "user-friendly" but standards unconformant behavior.

                    • xml dev posts?
                      2003-01-24 03:35:19 bryan rasmussen [Reply]

                      This reminds me of a post on xml-dev where some guy named Tim Bray talked about using MSXML to prove to people that their xml was not well-formed, it was an off-hand remark, but he said something along the lines of that people usually grasped that xml was not well-formed when he had them open it in IE and it told them there was a problem.


                      Of course I don't know if this Tim Bray character might be someone to listen to. probably not, but still, just saying.

        • Agreed. So what's your solution?
          2003-01-22 20:47:33 Max Daymon [Reply]

          Build in functionality to report back to feeds providing garbage data. Make it easy to report to sites that their feeds are causing a problem.


          The path of silently dealing with garbage data leads to excessive amounts of development time being spent on a problem which should take virtually no time. Further, it reflects poorly on the aggregator when it does run into a feed it can't deal with. Instead of blaming the feed, users now blame the tool for not handling it.


          If I can't reasonably rely on RSS being well formed and complying to an industry standard specification, I'm more inclined to simply remove the functionality than to enter an endless back and forth battle of regular expressions and garbage data.


          Put a fence at the top of the cliff, not an ambulance force at the bottom. Tools which generate problems will eventually fall from favor. All things considered, 10% failure for such a technology seems promising. There was a time when it was hardly possible to find ANY well formed web pages.


          • Agreed. So what's your solution?
            2004-03-04 09:56:00 Richard Prosser [Reply]

            As an end user, I want a news aggregator that works for whatever feeds I refer to, thus I am very grateful for Mark's efforts.


            I understand the "well formed" arguments however, and the difficulties inherent in providing feedback. I suggest that we shame the poorly-behaving sites by publishing their URLs for all the world to see, then issuing a press release.


            How about naffrss.org?


          • Auto-reporting
            2003-01-22 21:07:36 Mark Pilgrim [Reply]

            Many feeds have no contact information, so this can not be easily automated. Regardless, I believe efforts are underway to do exactly this (when possible) in the next release of Aggie. Users who care about such things can take the time to contact the content provider.


            However, this does not negate the fact that, as an end-user product, the #1 responsibility of the software is to the end user. The end user wishes to read news, and has downloaded, installed, and possibly paid for a program to help them read news. If the program refuses to display news for reasons that the end user considers arcane and trivial, the user will find another program that does not throw such technical hissy fits.

            • Do both
              2003-01-22 22:42:19 Chris Adams [Reply]

              Why not do both? If the XML validator fails, display an unobtrusive quality indicator like iCab (the smiling face in the throbber changes to a frown for malformed HTML), automatically send some sort of request to a tracking site and fall back to the error-prone all-costs parser.


              The tracking site would be extremely valuable if it could track the buggy software instead of just individual sites. Feeding crawler with, say, the weblogs.com feed would probably give a pretty accurate indicator of the relatively quality of the RSS implementations. While the users may not care, the authors might be more motivated about getting unlisted from the hall of shame.