Lately I've seen HTML parsing problems everywhere. One project needed a web crawler with specialized features provided through Python code that processed arbitrary HTML. There have also been several threads on mailing lists I frequent (including XML-SIG) featuring discussions of mechanisms for dealing with broken HTML by converting it to decent XHTML. This article focuses on Python APIs for converting good or bad HTML to XML.
Based on glowing testimonials from others with HTML-parsing tasks, I looked first at BeautifulSoup; but clearly, based on the description of the project goals and examination of the API, BeautifulSoup is suited for extracting bits of data from HTML, rather than converting it into XML. I did, however, work up Listing 1 as a simple test-case of bad XHTML, based on an example in the BeautifulSoup documentation.
Listing 1: An Example of Bad HTML
<body> Go <a class="that" href="here.html"><i>here</i></a> or <i>go <b><a href="index.html">Home</a> <!--noncetag>spam</noncetag><!--eggs--> </html>
Notice the broken comment in the file. I added it in because I've
seen HTML parsers tripped up by strange use of
<!--noncetag>spam</noncetag><!--eggs--> is a bad comment because it contains two dashes in its body.
a Python wrapper for the HTML Tidy Library Project
(libtidy), an embeddable variation on Dave Raggett's HTML Tidy
command-line program. Libtidy is in C and uTidyLib is a
minimalist and straightforward wrapping. I downloaded
uTidylib-0.2.zip and installed it. It requires Libtidy and I
downloaded and installed the source code package
dated 11 August 2004. uTidyLib also uses ctypes, "a Python package to create and manipulate C data types in Python,
and to call functions in dynamic link libraries/shared dlls." I
downloaded and installed
ctypes-0.9.0.tar.gz. In all there were a
lot of parts to find and set up, but the instructions were
straightforward and I had no installation problems. I used the
example from the uTidyLib home page to be sure it all worked in
Listing 2 is the first program I worked up for taking an input file name of bad HTML and converting the contents to XHTML.
Listing 2: uTidyLib Program to Convert HTML to XHTML
import tidy import sys def tidy2xhtml(instream, outstream): options = dict(output_xhtml=1, add_xml_decl=1, indent=1 ) tidied = tidy.parseString(instream.read(), **options) tidied.write(outstream) return doc = open(sys.argv) tidy2xhtml(doc, sys.stdout)
I had to read in the entire input file to read the contents as a
string because uTidyLib provides no interface for getting HTML
source from a file-like object.
tidy.parse is the
other available function, but it takes a file name. This could be
inconvenient in the case of large source files. The options
dictionary represents options for the underlying Tidy, which are
listed in the HTML Tidy Quick Reference. Using the dictionary constructor idiom,
options have to be provided in a form acceptable as Python
identifiers, particularly by converting hyphens to underscores, so
the Tidy option
fix-bad-comments would be specified as
The result of running Listing 2 against Listing 1 is as follows:
<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for Linux/x86 (vers 1st August 2004), see www.w3.org" /> <title></title> </head> <body> Go <a class="that" href="here.html"><i>here</i></a> or <i>go <b><a href="index.html">Home</a> <!--noncetag>spam</noncetag><!==eggs--></b></i> </body> </html>
Notice how the bad comment is corrected by replacing the "--"
with "==". uTidyLib also fills out all the half-specified elements,
whether valid SGML tag minimization (it's perfectly legal HTML to
p tags, for instance) or not (there is a closing but
not an opening
I tried uTidyLib on a variety of files, usually with very nice XHTML results. I also tried with a variety of encodings, since the web crawler project I mentioned involved crawling international versions of sites. I ran into trouble as soon as I tried pages in Japanese. As an example, I use the Japanese document Hello world HTML, which is actually perfectly valid HTML that just happens to be encoded in the popular Shift-JIS encoding (there is a mix of English and Japanese in the document). Figure 1 is a bit of English/Japanese mix from the Table of Contents.
Figure 1: Sample of English and Japanese Text from Valid HTML Document
auto encoding detect probs weird encoding names
This bullet item gets turned into the following XML by uTidyLib:
<li> <a href="hwht01.htm" accesskey="1">Section 1</a> : HTML Šî‘b‚ÌŠî‘b </li>
This would be rendered in the browser as in Figure 2:
Figure 2: Sample of English and Japanese Text from Valid HTML Doc after Mangling by uTidyLib
Clearly not what the original document author intended. It turns
out that Tidy cannot really detect the source document's encoding,
even when it's properly and clearly stated (the document
LANG="ja" in the
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Shift_JIS">). Tidy just assumes ISO-8859-1. It
also turns out that Tidy outputs US-ASCII encoding by default. I
suppose the US-ASCII default for generated XHTML is to accommodate
outdated browsers that can't deal with UTF-8 and UTF-16. The
inability to detect encodings, on the other hand, is unfortunate and
a severe limitation. I trawled the options, and I couldn't find
anything to turn encoding detection on, but I did find options to
tell Tidy what input and output encodings to use.
I updated the program in Listing 2 to specify an encoding for the source document by hand (e.g. "Shift-JIS") and to always produce UTF-8 output. In doing so I ran into another odd limitation in Tidy. It seems to refuse encoding names unless they are in all lowercase, with all dashes eliminated. For example, it refused "Shift-JIS" or "UTF-8", throwing an exception: "tidy.error.OptionArgError: missing or malformed argument for option: input-encoding". By trial and error I figured out that "shiftjis" and "utf8" were required for things to work and that I could not use any likely spelling of "ISO-8859-1" at all, but had to use the alternate name "latin1" instead. The updated code is in Listing 3.
Listing 3: uTidyLib Program to Convert HTML to XHTML Using Specified Encodings
import tidy import sys def tidy2xhtml(instream, outstream): options = dict(output_xhtml=1, add_xml_decl=1, indent=1, output_encoding='utf8', input_encoding=encoding ) tidied = tidy.parseString(instream.read(), **options) tidied.write(outstream) return doc = open(sys.argv) try: encoding = sys.argv except: encoding = 'latin1' tidy2xhtml(doc, sys.stdout)
This allows me to specify the encoding as the second command-line argument if I know it.
libxml2's HTML Parser
I'm always surprised to see what useful bits are buried in libxml2 and available through the Python binding (see my article on this topic). One of them is an HTML reader that can handle bad HTML and create a tree object that is not at all XHTML, but is at least a well-formed rendition of the source document, which is usually good enough. The following snippet illustrates this tool.
>>> import libxml2 >>> #Again seems to require the full string >>> source = open('listing1.html').read() >>> hdoc = libxml2.htmlParseDoc(source, None) HTML parser error : Opening and ending tag mismatch: html and b </html> ^ HTML parser error : Opening and ending tag mismatch: html and i </html> ^
Despite these warnings,
hdoc is a usable node at this point. It doesn't give you
DOM, but rather libxml2's specialized tree API, which, as I
mentioned in an earlier article, I find unevenly documented and hard
to navigate. The libxml2 page talks about "DOM," but I think they use
the term generically, not meaning the W3C specification and certainly
not the Python standard-library DOM conventions.
>>> print hdoc /usr/lib/python2.3/site-packages/libxml2.py:3597: \ FutureWarning: %u/%o/%x/%X of negative int will \ return a signed string in Python 2.4 and up return "<xmlDoc (%s) object at 0x%x>" % (self.name, id (self)) <xmlDoc (None) object at 0xf7032bcc>
The warning, which I got with my Python 2.3 installation, will
only appear the first time you convert a node to string
(e.g. implicitly, using
>>> print hdoc.serialize() <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p> Go <a class="that" href="here.html"><i>here</i></a> or <i>go <b><a href="index.html">Home</a> <!--noncetag>spam</noncetag><!--eggs--> </b></i></p></body></html> >>> hdoc.freeDoc()
Clearly it didn't complete the document as effectively as
uTidyLib: it didn't fix the broken comment, the generated output
document-type declaration is untenable for an XML document, but
the result is useful nevertheless. Don't forget
freeDoc() call, since libxml2/Python requires
manual memory management.
>>> uri = 'http://www.tg.rim.or.jp/~hexane/ach/hwht/' >>> hdoc = libxml2.htmlParseFile(uri, None)
I've heard some other Python tools discussed for converting HTML to usable XML (or XML tree objects):
- ElementTidy uses Tidy to create XHTML in the form of an ElementTree object (see my article on the topic).
- The twisted.web.microdom
module in Twisted has an
beExtremelyLenient=Truethat creates a tree from even broken HTML.
If you just need to extract information from broken HTML, there are some other options.
- The aforementioned BeautifulSoup.
- The HTML Scraper recipe on the Python Cookbook needs a lot of tweaking, based on my experience.
Thanks to the XML thread and especially participants on this thread discussing Python parsers for broken HTML. If you have other suggestions I haven't covered, please post them as comments to this article.
News and Notes
XML-SIG members and others have, as usual, been busy this last month. Mike Hostetler announced XMLBuilder 1.1. "You create an XMLBuilder object, send it some dictionary data, and it will generate the XML for you." See the announcement.
Mark Pilgrim announced the publication of his book Dive Into Python, available in its entirety online (though you should buy the physical book if you like it). Chapter 9: XML Processing is especially of interest. See the announcement.
Also in Python and XML
I released Scimitar 0.6.0, an update of my ISO Schematron implementation that compiles a Schematron schema into a Python validator script. It adds support for keys, fixes diagnostic messages, and a few other things. See the announcement.
Fredrik Lundh did some time and space benchmarks of Python libraries for parsing and representing XML. It includes minidom, elementtree, PyRXPu (the only XML compliant variant of PyRXP), and pxdom. He does not specify his methodology, except that he parsed a "3.5 MB source file." More clarity on his test methods, including harness code and measurement methodology, would be nice. He plans to add xml.objectify, libxml/Python, and cDomlette.
Jarno Virtanen has posted some quick code for performing an XSL transformation in Jython. I've added this to my reference page on XSLT Python processing APIs for Python.