What Does XML Smell Like?
This article introduces a set of heuristic rules for sniffing the content of a file in order to determine whether it is an XML document or an HTML document. An implementation is provided using the xmlReader interface of libxml2. This implementation is used in Prince, a formatter for creating PDF files from web documents.
Problem
Say a user agent wants to load a web document and display it, format it, process it, or whatever. It might be an XML document, containing XHTML, SVG, MathML, or a nutritious mix of these vocabularies. Or it might be an HTML document, ideally valid HTML4, but more likely an unappetizing bowl of tag soup. The problem is, how does the user agent know whether to parse the document as XML or HTML?
If the document is being retrieved over the Web, then there is no problem, as the HTTP response will come with a Content-Type header that gives the MIME type of the document. This may be text/html for HTML, application/xml for XML or application/xhtml+xml for XHTML. The user agent can check the MIME type before trying to parse the document, and all is well.
However, if the document is being loaded from a local file, there is no obvious way to determine if it is XML or HTML. The user agent might try checking the file extension, but what if it is .html? It is common for XHTML files to be given an extension of .html or .htm, as .xhtml is rather long and .xht is rather obscure. This means that a file with an extension of .html may actually be an XML document and require an XML parser.
If the user agent parses an HTML document with an XML parser, the user will be rewarded with a blast of error messages from all the apparently unclosed tags, like <br>. If the user agent parses an XML document with a HTML parser, the results are not much better. While the document will probably load, the user may not get what he expects, as style sheets and scripts may behave differently, embedded SVG or MathML content will be garbled, and external entities and inclusions will not be resolved.
Web user agents like Prince need a way to determine whether a .html file should be parsed as XML or HTML. In the absence of telepathy, there is no perfect algorithm to determine the intent of the author, so we will need to formulate some heuristics that can sniff the content of the document and see if it smells like XML or HTML.
Heuristics
Most of the differences between HTML and XHTML documents occur at the beginning, so we can restrict our heuristics to looking at everything up to and including the first start tag in the document. We will parse the beginning of the document using an XML parser and apply the following rules:
-
If the document has an XML declaration, then it must be an XML document. (Technically, given that HTML is derived from SGML, it is possible for an HTML document to include arbitrary processing instructions, even including
<?xml?>at the start of the document if the author is sufficiently pathological. In the real world, however, a file beginning with an XML declaration is making a clear statement that it is an XML file and should be treated as such.) -
If the document has an
<?xml-stylesheet?>processing instruction in the prolog, then it must be an XML document. (Again, while an HTML document could potentially contain such a processing instruction, there is absolutely no reason for it to do so other than to deceive people into treating the document as XML.) -
In fact, given that hardly any HTML documents on the Web contain processing instructions, we could simply treat any document containing a processing instruction before the first start tag as an XML document. (If anyone from Google is reading this, how about verifying this claim by running grep over your local copy of the web?)
-
If the document has a DOCTYPE declaration with a public identifier containing "XHTML," such as
-//W3C//DTD XHTML 1.0 Transitional//EN, then it is definitely an XML document. (Either that, or the author has blindly cut and pasted the DOCTYPE declaration from somewhere else without knowing what it means and is about to find out the hard way.) -
On the other hand, a DOCTYPE declaration with a public identifier containing "HTML," such as
-//W3C//DTD HTML 4.01 Transitional//EN, means that it must be an HTML document, not XML. -
If the DOCTYPE declaration has a system identifier but no public identifier, then it must be an XML document, as XML removed the requirement to have a public identifier in the DOCTYPE declaration.
-
If the document has an empty DOCTYPE declaration of
<!DOCTYPE html>, then it must be an HTML document using the new DOCTYPE idiom introduced by the WHAT-WG to identify HTML5 documents. -
If the DOCTYPE declaration has no public or system identifier and defines an internal subset, then the document must be XML, as HTML documents rarely define an internal subset in this way.
-
If we reach the first start tag in the document and none of the heuristic rules have matched yet, then we need to look at the attributes on the root element. Any
xmlns,xmlns:*, orxml:*attributes, such asxml:langorxml:base, mean that the document must be XML. -
If we encounter a syntax error trying to parse the beginning of the document with an XML parser, then we will assume that the document is actually HTML. This will catch the common HTML idiom of putting junk at the top of the file and letting user agents benevolently ignore it. This heuristic will misidentify documents that are intended to be XML but have syntax errors before the first start tag. In practice, this is not a problem, as most XML syntax errors occur in the body of the document and not in the prolog.
-
If, after all that, we still don't know, then we had better assume that the document is HTML. This is appropriate fallback behavior for a file with a
.htmlextension, and seems like a good tradeoff, as users who don't care about XML will not see XML-related error messages in the (likely) case that the document turns out not to be well-formed.
As soon as any one of the rules matches, it will determine whether the document smells like XML or like HTML, and we can stop sniffing and report a result. There are other more subtle heuristics that could be added, but these should be sufficient to correctly classify most documents.
Pages: 1, 2 |