XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

What Does XML Smell Like?
by Michael Day | Pages: 1, 2

Examples

From the last heuristic, it is clear that this document smells like HTML:

<html>
<head>
<title>What am I?</title>
...

This seems reasonable, as there is nothing to indicate that this document is XML. It is possible that later in the document there might be some XML content with namespaces that will fail to impress the HTML parser, but this falls under the topic of parsing XML islands in HTML documents; the jury is still out on the legality of this.

We can indicate that this document is actually XHTML by adding an XML declaration, or an XHTML DOCTYPE declaration, or by adding an xmlns attribute to the root element. These are all sensible things to do if the document is really intended to be XHTML, and they make it obvious to human readers as well as to programs.

Note that strictly conforming XHTML1 documents should be easy for our heuristics to recognize, as they must have a DOCTYPE declaration with a public identifier that references one of the three XHTML1 DTDs and an xmlns attribute on the root element. Document authors are also encouraged to add an XML declaration. However, a user agent also needs to handle XHTML documents that reference other DTDs, such as the XHTML + MathML DTD, and may lack an explicit xmlns attribute or specify an incorrect namespace URI. The heuristics above can correctly handle these documents.

Implementation

In Prince, these document sniffing heuristic rules are implemented as a C function that uses the xmlReader interface from libxml2 to parse the document up until the first start tag or one of the heuristics matches. A copiously commented version of the code, as well as some sample documents to test it on, is available for download in the "Code" section below; it compiles to a small program that sniffs files and classifies them as being XML or HTML.

One caveat with this implementation is that, while we only explicitly parse up to the first start tag in the document, behind the scenes the libxml2 xmlReader appears to be parsing further ahead for efficiency, as it assumes we ultimately intend to parse the entire document. This means that it is possible for the xmlReader interface to reach a syntax error that occurs shortly after the first start tag, in which case our heuristic will conclude that the document must be HTML and stop. This is not really a problem, but in some cases can result in slightly more confusing error messages for XML documents that contain syntax errors near the top of the file.

You could also implement the heuristics using the xmlReader interface from .NET (on which the libxml2 interface is based) or any of the XML pull-parser libraries available for Java. Another option is to implement the heuristics using a SAX parser instead, ensuring that it doesn't try to be clever and eagerly parse ahead of where it should be. Just make sure that you remember to stop the SAX parser when a heuristic matches or when you reach the first start tag in the document.

Code

Here's the source code for sniffxml, a program that sniffs files to determine if they are XML or HTML.

Notes

  1. On the web, content sniffing is considered harmful. When user agents ignore the metadata in the HTTP response and try to guess the type of the document, it can lead to confusing behavior, lack of interoperability, and even security problems. The heuristics described in this article should only be applied to local files where no other type information is available. When a document is retrieved over HTTP, the user agent should always respect the Content-Type header.

  2. It is considered to be poor practice to determine the semantics of XML or SGML documents based on their DOCTYPE declarations. The DOCTYPE is a purely syntactical construct that does not specify the meaning of the document, so a user agent should not choose how to handle a document by looking at the public identifier specified in the DOCTYPE. (The split between quirks mode and standards mode in browsers is one processing model that breaks this rule, and it exists purely to compensate for a lack of interoperability and standards compliance in older browsers.)

    The heuristics described in this article do examine the DOCTYPE declaration. However, they only do this in order to determine whether a document is most likely to be XML or HTML. Once this has been determined, the document can be parsed as normal and the DOCTYPE will not affect the semantics of the document.

  3. For an in-depth look at the issues affecting XML and HTML on the Web, see Sending XHTML as text/html Considered Harmful, by Ian Hickson.

  4. It would be interesting to know if the question of what XML smells like triggers any illuminating associations in the minds of people who have been working with XML for years. Please leave a comment if you think XML smells like apple pie or the first breath of spring. Or more realistically, if the question evokes a response of "worse than week-old prawns on a hot summer day," well, that's good too.



1 to 1 of 1
  1. Don't forget
    2007-03-05 11:53:17 EnricoPulatzo
1 to 1 of 1