Menu

Wrestling HTML

September 8, 2004

Uche Ogbuji

Lately I've seen HTML parsing problems everywhere. One project needed a web crawler with specialized features provided through Python code that processed arbitrary HTML. There have also been several threads on mailing lists I frequent (including XML-SIG) featuring discussions of mechanisms for dealing with broken HTML by converting it to decent XHTML. This article focuses on Python APIs for converting good or bad HTML to XML.

Based on glowing testimonials from others with HTML-parsing tasks, I looked first at BeautifulSoup; but clearly, based on the description of the project goals and examination of the API, BeautifulSoup is suited for extracting bits of data from HTML, rather than converting it into XML. I did, however, work up Listing 1 as a simple test-case of bad XHTML, based on an example in the BeautifulSoup documentation.

Listing 1: An Example of Bad HTML

<body>

Go <a class="that" href="here.html"><i>here</i></a>

or <i>go <b><a href="index.html">Home</a>

<!--noncetag>spam</noncetag><!--eggs-->

</html>  

Notice the broken comment in the file. I added it in because I've seen HTML parsers tripped up by strange use of comments. <!--noncetag>spam</noncetag><!--eggs--> is a bad comment because it contains two dashes in its body.

uTidyLib

uTidyLib is a Python wrapper for the HTML Tidy Library Project (libtidy), an embeddable variation on Dave Raggett's HTML Tidy command-line program. Libtidy is in C and uTidyLib is a minimalist and straightforward wrapping. I downloaded uTidylib-0.2.zip and installed it. It requires Libtidy and I downloaded and installed the source code package tidy_src.tgz dated 11 August 2004. uTidyLib also uses ctypes, "a Python package to create and manipulate C data types in Python, and to call functions in dynamic link libraries/shared dlls." I downloaded and installed ctypes-0.9.0.tar.gz. In all there were a lot of parts to find and set up, but the instructions were straightforward and I had no installation problems. I used the example from the uTidyLib home page to be sure it all worked in the end.

Listing 2 is the first program I worked up for taking an input file name of bad HTML and converting the contents to XHTML.

Listing 2: uTidyLib Program to Convert HTML to XHTML

import tidy

import sys



def tidy2xhtml(instream, outstream):

    options = dict(output_xhtml=1,

                   add_xml_decl=1,

                   indent=1

                   )

    tidied = tidy.parseString(instream.read(), **options)

    tidied.write(outstream)

    return



doc = open(sys.argv[1])



tidy2xhtml(doc, sys.stdout)  

I had to read in the entire input file to read the contents as a string because uTidyLib provides no interface for getting HTML source from a file-like object. tidy.parse is the other available function, but it takes a file name. This could be inconvenient in the case of large source files. The options dictionary represents options for the underlying Tidy, which are listed in the HTML Tidy Quick Reference. Using the dictionary constructor idiom, options have to be provided in a form acceptable as Python identifiers, particularly by converting hyphens to underscores, so the Tidy option fix-bad-comments would be specified as fix_bad_comments.

The result of running Listing 2 against Listing 1 is as follows:

<?xml version="1.0"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

  <head>

    <meta name="generator" content=

    "HTML Tidy for Linux/x86 (vers 1st August 2004), see www.w3.org" />

    <title></title>

  </head>

  <body>

    Go <a class="that" href="here.html"><i>here</i></a> or <i>go

    <b><a href="index.html">Home</a>

    <!--noncetag>spam</noncetag><!==eggs--></b></i>

  </body>

</html>  

Notice how the bad comment is corrected by replacing the "--" with "==". uTidyLib also fills out all the half-specified elements, whether valid SGML tag minimization (it's perfectly legal HTML to not close p tags, for instance) or not (there is a closing but not an opening html tag).

I tried uTidyLib on a variety of files, usually with very nice XHTML results. I also tried with a variety of encodings, since the web crawler project I mentioned involved crawling international versions of sites. I ran into trouble as soon as I tried pages in Japanese. As an example, I use the Japanese document Hello world HTML, which is actually perfectly valid HTML that just happens to be encoded in the popular Shift-JIS encoding (there is a mix of English and Japanese in the document). Figure 1 is a bit of English/Japanese mix from the Table of Contents.

Figure 1: Sample of English and Japanese Text from Valid HTML Document

auto encoding detect probs weird encoding names
auto encoding detect probs weird encoding names

This bullet item gets turned into the following XML by uTidyLib:

<li>

  <a href="hwht01.htm" accesskey="1">Section 1</a> : 

HTML &Scaron;&icirc;&lsquo;b&sbquo;&Igrave;&Scaron;&icirc;&lsquo;b

</li>

This would be rendered in the browser as in Figure 2:

Figure 2: Sample of English and Japanese Text from Valid HTML Doc after Mangling by uTidyLib

Clearly not what the original document author intended.

Clearly not what the original document author intended. It turns out that Tidy cannot really detect the source document's encoding, even when it's properly and clearly stated (the document has LANG="ja" in the html element and <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Shift_JIS">). Tidy just assumes ISO-8859-1. It also turns out that Tidy outputs US-ASCII encoding by default. I suppose the US-ASCII default for generated XHTML is to accommodate outdated browsers that can't deal with UTF-8 and UTF-16. The inability to detect encodings, on the other hand, is unfortunate and a severe limitation. I trawled the options, and I couldn't find anything to turn encoding detection on, but I did find options to tell Tidy what input and output encodings to use.

I updated the program in Listing 2 to specify an encoding for the source document by hand (e.g. "Shift-JIS") and to always produce UTF-8 output. In doing so I ran into another odd limitation in Tidy. It seems to refuse encoding names unless they are in all lowercase, with all dashes eliminated. For example, it refused "Shift-JIS" or "UTF-8", throwing an exception: "tidy.error.OptionArgError: missing or malformed argument for option: input-encoding". By trial and error I figured out that "shiftjis" and "utf8" were required for things to work and that I could not use any likely spelling of "ISO-8859-1" at all, but had to use the alternate name "latin1" instead. The updated code is in Listing 3.

Listing 3: uTidyLib Program to Convert HTML to XHTML Using Specified Encodings

import tidy

import sys



def tidy2xhtml(instream, outstream):

    options = dict(output_xhtml=1,

                   add_xml_decl=1,

                   indent=1,

                   output_encoding='utf8',

                   input_encoding=encoding

                   )

    tidied = tidy.parseString(instream.read(), **options)

    tidied.write(outstream)

    return



doc = open(sys.argv[1])

try:

    encoding = sys.argv[2]

except:

    encoding = 'latin1'



tidy2xhtml(doc, sys.stdout)  

This allows me to specify the encoding as the second command-line argument if I know it.

libxml2's HTML Parser

I'm always surprised to see what useful bits are buried in libxml2 and available through the Python binding (see my article on this topic). One of them is an HTML reader that can handle bad HTML and create a tree object that is not at all XHTML, but is at least a well-formed rendition of the source document, which is usually good enough. The following snippet illustrates this tool.

>>> import libxml2

>>> #Again seems to require the full string

>>> source = open('listing1.html').read()

>>> hdoc = libxml2.htmlParseDoc(source, None)

HTML parser error : Opening and ending tag mismatch: html and b

</html>

       ^

HTML parser error : Opening and ending tag mismatch: html and i

</html>

       ^  

Despite these warnings, hdoc is a usable node at this point. It doesn't give you DOM, but rather libxml2's specialized tree API, which, as I mentioned in an earlier article, I find unevenly documented and hard to navigate. The libxml2 page talks about "DOM," but I think they use the term generically, not meaning the W3C specification and certainly not the Python standard-library DOM conventions.

>>> print hdoc

/usr/lib/python2.3/site-packages/libxml2.py:3597: \

FutureWarning: %u/%o/%x/%X of negative int will \

return a signed string in Python 2.4 and up

  return "<xmlDoc (%s) object at 0x%x>" % (self.name, id (self))

<xmlDoc (None) object at 0xf7032bcc>  

The warning, which I got with my Python 2.3 installation, will only appear the first time you convert a node to string (e.g. implicitly, using print) and seems harmless. I assume the libxml2 crew will address any potential problems before Python 2.4 is finalized. You can see the document libxml2 interpreted from the bad HTML by re-serialization.

>>> print hdoc.serialize()

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

"http://www.w3.org/TR/REC-html40/loose.dtd">

<html><body><p>

Go <a class="that" href="here.html"><i>here</i></a>

or <i>go <b><a href="index.html">Home</a>

<!--noncetag>spam</noncetag><!--eggs-->

</b></i></p></body></html>

 

>>> hdoc.freeDoc()  

Clearly it didn't complete the document as effectively as uTidyLib: it didn't fix the broken comment, the generated output document-type declaration is untenable for an XML document, but the result is useful nevertheless. Don't forget the freeDoc() call, since libxml2/Python requires manual memory management.

>>> uri = 'http://www.tg.rim.or.jp/~hexane/ach/hwht/'

>>> hdoc = libxml2.htmlParseFile(uri, None)  

The result from re-serialization seemed to maintain the Shift-JIS content, but I got a very strange JavaScript error message when I wrote it to a file and tried to view it in Firefox. Clearly dealing with HTML files in various encodings is a difficult task that complicates any efforts to cleanly process the HTML.

Wrap Up

I've heard some other Python tools discussed for converting HTML to usable XML (or XML tree objects):

If you just need to extract information from broken HTML, there are some other options.

  • The aforementioned BeautifulSoup.
  • The HTML Scraper recipe on the Python Cookbook needs a lot of tweaking, based on my experience.

Thanks to the XML thread and especially participants on this thread discussing Python parsers for broken HTML. If you have other suggestions I haven't covered, please post them as comments to this article.

News and Notes

XML-SIG members and others have, as usual, been busy this last month. Mike Hostetler announced XMLBuilder 1.1. "You create an XMLBuilder object, send it some dictionary data, and it will generate the XML for you." See the announcement.

Mark Pilgrim announced the publication of his book Dive Into Python, available in its entirety online (though you should buy the physical book if you like it). Chapter 9: XML Processing is especially of interest. See the announcement.

    

Also in Python and XML

Processing Atom 1.0

Should Python and XML Coexist?

EaseXML: A Python Data-Binding Tool

More Unicode Secrets

Unicode Secrets

I released Scimitar 0.6.0, an update of my ISO Schematron implementation that compiles a Schematron schema into a Python validator script. It adds support for keys, fixes diagnostic messages, and a few other things. See the announcement.

Fredrik Lundh did some time and space benchmarks of Python libraries for parsing and representing XML. It includes minidom, elementtree, PyRXPu (the only XML compliant variant of PyRXP), and pxdom. He does not specify his methodology, except that he parsed a "3.5 MB source file." More clarity on his test methods, including harness code and measurement methodology, would be nice. He plans to add xml.objectify, libxml/Python, and cDomlette.

Jarno Virtanen has posted some quick code for performing an XSL transformation in Jython. I've added this to my reference page on XSLT Python processing APIs for Python.