What Does XML Smell Like?

February 28, 2007

This article introduces a set of heuristic rules for sniffing the content of a file in order to determine whether it is an XML document or an HTML document. An implementation is provided using the xmlReader interface of libxml2. This implementation is used in Prince, a formatter for creating PDF files from web documents.

Problem

Say a user agent wants to load a web document and display it, format it, process it, or whatever. It might be an XML document, containing XHTML, SVG, MathML, or a nutritious mix of these vocabularies. Or it might be an HTML document, ideally valid HTML4, but more likely an unappetizing bowl of tag soup. The problem is, how does the user agent know whether to parse the document as XML or HTML?

If the document is being retrieved over the Web, then there is no problem, as the HTTP response will come with a Content-Type header that gives the MIME type of the document. This may be text/html for HTML, application/xml for XML or application/xhtml+xml for XHTML. The user agent can check the MIME type before trying to parse the document, and all is well.

However, if the document is being loaded from a local file, there is no obvious way to determine if it is XML or HTML. The user agent might try checking the file extension, but what if it is .html? It is common for XHTML files to be given an extension of .html or .htm, as .xhtml is rather long and .xht is rather obscure. This means that a file with an extension of .html may actually be an XML document and require an XML parser.

If the user agent parses an HTML document with an XML parser, the user will be rewarded with a blast of error messages from all the apparently unclosed tags, like <br>. If the user agent parses an XML document with a HTML parser, the results are not much better. While the document will probably load, the user may not get what he expects, as style sheets and scripts may behave differently, embedded SVG or MathML content will be garbled, and external entities and inclusions will not be resolved.

Web user agents like Prince need a way to determine whether a .html file should be parsed as XML or HTML. In the absence of telepathy, there is no perfect algorithm to determine the intent of the author, so we will need to formulate some heuristics that can sniff the content of the document and see if it smells like XML or HTML.

Heuristics

Most of the differences between HTML and XHTML documents occur at the beginning, so we can restrict our heuristics to looking at everything up to and including the first start tag in the document. We will parse the beginning of the document using an XML parser and apply the following rules:

If the document has an XML declaration, then it must be an XML document. (Technically, given that HTML is derived from SGML, it is possible for an HTML document to include arbitrary processing instructions, even including <?xml?> at the start of the document if the author is sufficiently pathological. In the real world, however, a file beginning with an XML declaration is making a clear statement that it is an XML file and should be treated as such.)
If the document has an <?xml-stylesheet?> processing instruction in the prolog, then it must be an XML document. (Again, while an HTML document could potentially contain such a processing instruction, there is absolutely no reason for it to do so other than to deceive people into treating the document as XML.)
In fact, given that hardly any HTML documents on the Web contain processing instructions, we could simply treat any document containing a processing instruction before the first start tag as an XML document. (If anyone from Google is reading this, how about verifying this claim by running grep over your local copy of the web?)
If the document has a DOCTYPE declaration with a public identifier containing "XHTML," such as -//W3C//DTD XHTML 1.0 Transitional//EN, then it is definitely an XML document. (Either that, or the author has blindly cut and pasted the DOCTYPE declaration from somewhere else without knowing what it means and is about to find out the hard way.)
On the other hand, a DOCTYPE declaration with a public identifier containing "HTML," such as -//W3C//DTD HTML 4.01 Transitional//EN, means that it must be an HTML document, not XML.
If the DOCTYPE declaration has a system identifier but no public identifier, then it must be an XML document, as XML removed the requirement to have a public identifier in the DOCTYPE declaration.
If the document has an empty DOCTYPE declaration of <!DOCTYPE html>, then it must be an HTML document using the new DOCTYPE idiom introduced by the WHAT-WG to identify HTML5 documents.
If the DOCTYPE declaration has no public or system identifier and defines an internal subset, then the document must be XML, as HTML documents rarely define an internal subset in this way.
If we reach the first start tag in the document and none of the heuristic rules have matched yet, then we need to look at the attributes on the root element. Any xmlns, xmlns:*, or xml:* attributes, such as xml:lang or xml:base, mean that the document must be XML.
If we encounter a syntax error trying to parse the beginning of the document with an XML parser, then we will assume that the document is actually HTML. This will catch the common HTML idiom of putting junk at the top of the file and letting user agents benevolently ignore it. This heuristic will misidentify documents that are intended to be XML but have syntax errors before the first start tag. In practice, this is not a problem, as most XML syntax errors occur in the body of the document and not in the prolog.
If, after all that, we still don't know, then we had better assume that the document is HTML. This is appropriate fallback behavior for a file with a .html extension, and seems like a good tradeoff, as users who don't care about XML will not see XML-related error messages in the (likely) case that the document turns out not to be well-formed.

As soon as any one of the rules matches, it will determine whether the document smells like XML or like HTML, and we can stop sniffing and report a result. There are other more subtle heuristics that could be added, but these should be sufficient to correctly classify most documents.

Examples

From the last heuristic, it is clear that this document smells like HTML:

<html>

<head>

<title>What am I?</title>

...

This seems reasonable, as there is nothing to indicate that this document is XML. It is possible that later in the document there might be some XML content with namespaces that will fail to impress the HTML parser, but this falls under the topic of parsing XML islands in HTML documents; the jury is still out on the legality of this.

We can indicate that this document is actually XHTML by adding an XML declaration, or an XHTML DOCTYPE declaration, or by adding an xmlns attribute to the root element. These are all sensible things to do if the document is really intended to be XHTML, and they make it obvious to human readers as well as to programs.

Note that strictly conforming XHTML1 documents should be easy for our heuristics to recognize, as they must have a DOCTYPE declaration with a public identifier that references one of the three XHTML1 DTDs and an xmlns attribute on the root element. Document authors are also encouraged to add an XML declaration. However, a user agent also needs to handle XHTML documents that reference other DTDs, such as the XHTML + MathML DTD, and may lack an explicit xmlns attribute or specify an incorrect namespace URI. The heuristics above can correctly handle these documents.

Implementation

In Prince, these document sniffing heuristic rules are implemented as a C function that uses the xmlReader interface from libxml2 to parse the document up until the first start tag or one of the heuristics matches. A copiously commented version of the code, as well as some sample documents to test it on, is available for download in the "Code" section below; it compiles to a small program that sniffs files and classifies them as being XML or HTML.

One caveat with this implementation is that, while we only explicitly parse up to the first start tag in the document, behind the scenes the libxml2 xmlReader appears to be parsing further ahead for efficiency, as it assumes we ultimately intend to parse the entire document. This means that it is possible for the xmlReader interface to reach a syntax error that occurs shortly after the first start tag, in which case our heuristic will conclude that the document must be HTML and stop. This is not really a problem, but in some cases can result in slightly more confusing error messages for XML documents that contain syntax errors near the top of the file.

You could also implement the heuristics using the xmlReader interface from .NET (on which the libxml2 interface is based) or any of the XML pull-parser libraries available for Java. Another option is to implement the heuristics using a SAX parser instead, ensuring that it doesn't try to be clever and eagerly parse ahead of where it should be. Just make sure that you remember to stop the SAX parser when a heuristic matches or when you reach the first start tag in the document.

Code

Here's the source code for sniffxml, a program that sniffs files to determine if they are XML or HTML.

Notes

On the web, content sniffing is considered harmful. When user agents ignore the metadata in the HTTP response and try to guess the type of the document, it can lead to confusing behavior, lack of interoperability, and even security problems. The heuristics described in this article should only be applied to local files where no other type information is available. When a document is retrieved over HTTP, the user agent should always respect the Content-Type header.
It is considered to be poor practice to determine the semantics of XML or SGML documents based on their DOCTYPE declarations. The DOCTYPE is a purely syntactical construct that does not specify the meaning of the document, so a user agent should not choose how to handle a document by looking at the public identifier specified in the DOCTYPE. (The split between quirks mode and standards mode in browsers is one processing model that breaks this rule, and it exists purely to compensate for a lack of interoperability and standards compliance in older browsers.)

The heuristics described in this article do examine the DOCTYPE declaration. However, they only do this in order to determine whether a document is most likely to be XML or HTML. Once this has been determined, the document can be parsed as normal and the DOCTYPE will not affect the semantics of the document.
For an in-depth look at the issues affecting XML and HTML on the Web, see Sending XHTML as text/html Considered Harmful, by Ian Hickson.
It would be interesting to know if the question of what XML smells like triggers any illuminating associations in the minds of people who have been working with XML for years. Please leave a comment if you think XML smells like apple pie or the first breath of spring. Or more realistically, if the question evokes a response of "worse than week-old prawns on a hot summer day," well, that's good too.