microdom: an XML DOM Designed For HTML

October 15, 2003

As the change from HTML to XML-based standards takes place, it is natural to start using XML tools and APIs for processing and generating HTML documents. Unfortunately the change to newer standards is not without problems and requires additional work on part of the software. Although XHTML is an XML vocabulary, millions of HTML web pages still exist that are not well-formed. When it comes to generating HTML, a small but still significant number of browsers have serious problems rendering HTML that is well-formed XML. This article introduces microdom, a XML DOM implementation written in Python which was designed for dealing with HTML's legacy issues both when parsing and when generating documents.

microdom was originally designed as the underlying library for a Python web templating framework called Woven. Starting out as a generic implementation of a subset of the W3C XML DOM API for manipulating XML trees, microdom quickly grew to support HTML-specific features. In an ideal world, any XML library could be used to for creating web pages, but in practice this doesn't work well. However, it is possible to create a library that supports and follows the XML standards, generating well-formed XML, while still being able to deal with the issues created by HTML. microdom is one such library; in the following example, we see how it can be used to generate a well-formed XML snippet:

  from twisted.web import microdom
  d = microdom.Element("div")
  d.appendChild(microdom.Text("This is an example."))
  d.appendChild(microdom.Element("br"))
  print d.toprettyxml()
  
  # output is:
  #  <div>This is an example.<br />
  #  </div>

What are the issues microdom has to deal with? To begin with, outputting XML. HTML was created years before XML, and as a result it has different rules for formatting and structure. For example, unlike XML, a <br> tag need not be closed by a matching </br> or be in the form <br/>. A single <br> with no matching closing tag is perfectly valid HTML. In fact, older browsers will not recognize the correct XML form <br/>. As a result, HTML documents that are well-formed XML may not render correctly in these browsers. For example, here is how the output of Python's DOM implementation looks when creating a simple form:

  from xml.dom.minidom import Element
  d = Element("form")
  d.appendChild(Element("br"))
  d.appendChild(Element("textarea"))
  print d.toxml()

  # output is:
  #   <form><br/><textarea/></form>

In Netscape 4, the <br/> is not recognized as a <br> tag. Lynx parses the HTML is such a way that the <textarea> tag is considered to have never closed; the rest of the page is displayed inside the never-ending text area. Obviously, this is not acceptable: well-formedness should not come at the expense of correct rendering. With microdom, the output is still well-formed XML, but it is formatted in a way that allows older browsers to parse and render it correctly. By adding a space between the br and the > to form <br />, the tag is recognized correctly, and the <textarea> is split into two:

  from twisted.web import microdom
  d = microdom.Element("form")
  d.appendChild(microdom.Element("br"))
  d.appendChild(microdom.Element("textarea"))
  print d.toxml()

  # output is:
  #   <form><br /><textarea></textarea></form>

microdom solves the issue of HTML output without compromising XML compatibility. Internally this is implemented by keeping a list of tags that should only be present in pairs and changing the output appropriately.

Having seen how microdom supports HTML output, the next issue to consider is input, that is, parsing HTML. Older HTML documents are typically not well-formed XML. In addition to certain tags not requiring a closing tag, HTML is also case-insensitive: <div> is treated the same as <DIV>. Because of these departures from the XML standards, the parsers used by most DOM implementations will choke on valid HTML. microdom, by default, parses in case-insensitive mode, allowing it to parse HTML that has non-matching cases on tags. In the following example, Python's DOM implementation throws an exception, but microdom happily parses the HTML fragment:

  >>> xml.dom.minidom.parseString("<div>hello</DIV>")
  Traceback (most recent call last):
  ...
  xml.parsers.expat.ExpatError: mismatched tag: line 1, column 12

  >>> microdom.parseString("<div>hello</DIV>")
  <twisted.web.microdom.Document object at 0x82f6794>

When it comes to real HTML in the wild, case-insensitivity is the tip of the iceberg. Non-closed tags require changes to the XML parser to support them, but even that isn't enough. Many, perhaps most of the HTML documents found on the Web are horribly broken, even by HTML's less-than-stringent standards. The range of issues browsers have to deal with in order to render web pages is astoundingly large. A partial list of common problems includes:

Tags that are never closed, e.g. <h1><b>title</h1>
Tags whose nesting order is wrong, e.g. <b><i>hello</b></i>
Attributes that aren't surrounded by quotes, e.g. <div class=foo>
<script>s containing JavaScript with raw < and > characters.

To parse existing HTML documents, a parser must solve many of these issues (the problems that can be ignored are those related to display issues and, to some extent, validity). Faced with this unholy mess, it's tempting to give up. Nevertheless, the ability to parse broken HTML is worth attempting, as it has quite a few uses. A web spider that recursively traverses the Web, downloading pages as it goes, is one such use. This spider requires some sort of parsing; even more so if the spider wishes to extract some information from the page.

Another possible use is a web mail program, an email client that uses a web interface. Many email messages these days include a HTML attachment or even HTML body and the program would want to display it. Embedding the HTML message as is, however, might involve embedding broken HTML in the page. Web applications certainly ought to output valid, well-formed HTML, so this is an interesting dilemma. On the one hand, displaying the attachment inline will output broken HTML. On the other hand, not displaying the messages deprives the application of useful functionality. If were possible to parse the broken HTML attachment to a DOM tree, it would be possible to manipulate it and then output a well-formed version of the contents. This would allow the web mail program to display the message while still outputting non-broken well formed HTML.

In order to meet this need, microdom's parser has a lenient mode that makes a best effort to parse invalid and broken HTML. Even so, not all problems are dealt with correctly, but microdom is able to parse many HTML documents that standard parsers won't even look at. In the following example, microdom is used to parse a broken HTML document. The document exhibits a number of the issues mentioned above, including mismatched tags and non-quoted attributes. The DOM tree is then printed out, showing the resulting structure. The output is, of course, well-formed.

  >>> s = '''<html>hello<b><a href=
  ... http://www.example.com>Example!</b></a></HTML'''
  >>> d = microdom.parseString(s, beExtremelyLenient=1)
  >>> d
  <twisted.web.microdom.Document object at 0x82e8c1c>
  >>> print d.toprettyxml()
  <?xml version="1.0"?>

  <html>hello<b><a href="http://www.example.com">Example!</a></b>
  </html>

microdom is designed to work with HTML both when outputting and when parsing documents. Hopefully it will be a useful addition to the HTML processing toolbox. It can be obtained as part of the Twisted networking framework. Twisted uses microdom for Woven, a model-view templating toolkit for Twisted's web server. This article was based on the version of microdom that is included with Twisted 1.0.7.

microdom: an XML DOM Designed For HTML

Related Resources