Choosing an XML Parser

August 22, 2000

XML.com receives hundreds of questions a month. With such an avalanche of queries, you'd think our priority would be to answer as many as possible. This month, however, we tackle just one, the question of questions -- the question that nearly everyone asks at some point:

Which XML parser should I use?

Let's start with one of the most basic facts about using parsers: Unless you happen to be developing XML-processing software, you can pretty much forget about this question. It's like asking, "I want to buy a car and I want to make sure the wheels will stay on. What brand of lug nuts should I use?"

The question has meaning, to be sure. But it's not really a question to concern yourself with if you're just interested in browsing XML, editing it, or creating style sheets. The developer of the browsing application will almost certainly have made the decision for you, and you can probably override it only with difficulty (and perhaps intellectual pain). For instance, Microsoft's Internet Explorer 5.x browsers use a parser built into the MSXML.DLL file, and the Mozilla browser is built on a parser, written in C, called "expat."

Nevertheless, in some cases you do need to select a parser.

Checking documents online

If you're editing XML documents by hand rather than in a GUI-based editor, you need to check them for "correctness" before putting them to use. There are two degrees of "correctness" in the XML world: well-formedness and validity. Parsers may be similarly classified as non-validating and validating. Non-validating parsers ensure that a document meets the general rules of XML, such as that there's only one root element or that tags are properly balanced. Validating parsers perform more rigorous checks, such as making sure the document conforms to the rules laid out by its document type definition (DTD). Validating parsers also can use information from the DTD to provide extra capabilities, such as entity substitution and attribute defaulting.

(Note: Any parser that validates will also check for well-formedness. A parser which is nominally non-validating may or may not make use of a DTD if one is present.)

For this sort of application, I generally use one or more of the various Web-based syntax checkers. (Here, I'll call these "parsers" even though they're actually Web-based interfaces which sit on top of parsers.) Good ones are:

The XML Well-Formedness Checker and Validator can, as its name suggests, perform either validation or simple well-formedness checks, controllable with a checkbox on the page. (Default is well-formedness only.) It's based on Richard Tobin's RXP parser, part of the LT XML package from the Language Technology Group at the University of Edinburgh, which is also available as a standalone package if you prefer to run it locally rather than online. For the online version, you enter the URL of the document and just hit the "check it" button.
The Brown University Scholarly Technology Group's (STG) parser. Strictly a validator, this one also lets you enter a URL pointing to the document to be validated. You can also copy and paste XML into a text area, or select (via a Browse button) a file on your local system; this function makes it my syntax-checker of choice when developing new documents.
The RUWF? ("Are You Well Formed?") parser available on XML.com This well-formedness checker is based on the Perl XML::Parser module. Enter a URL and hit the RUWF? button to parse your document.

Aside from the well-formedness-vs.-validity and user-interface differences, these syntax-checkers differ in smaller ways. For example, both the Tobin and STG parsers can optionally be made namespace-aware if your application requires it. You may find that you prefer one tool's error reporting format to another's. And so on.

By the way, in theory you need to submit a given document to only one parser to ensure its "correctness." After all, the XML Recommendation is what it is, right? No wiggle room for interpreting a given chunk of code as correct or not, right? In practice, though, I've occasionally run into discrepancies, and for this reason I'll usually run a document through more than one parser just to be sure of no surprises when it's actually delivered to an application. (To their credit, the parser authors have always been very receptive to bug reports -- or, as the occasion warrants, to pointing out that it's my interpretation of the spec that's at fault!)

Checking it locally

But maybe for one reason or another you really do need to select a standalone parser. What criteria do you use?

First, there's the same validity vs. well-formedness consideration as with the online checkers. And within the well-formedness category, you may need some additional but optional features which are required only of a validating parser. Do you want the parser to supply an attribute's default value if the document author hasn't done so? Do you need the parser to be namespace-smart? In such cases, you can eliminate whole sub-categories of non-validating parsers from consideration.

Otherwise, the principal issues you need to consider are speed, size, and language binding (and other platform-related) issues.

Speed: If you're going to be parsing documents of only a few hundred elements, this is probably the least important concern. It looms larger, of course, as the documents go up in size (and as you need more validating-type features). Even so, I think you need to keep your head on straight about speed -- if you're serving XML documents over the Web, even a few seconds' difference in parsing speed is going to be the least of your problems.

Size: This is closely correlated to speed. The faster a parser is, the more likely that its code is tighter and its size (and, of course, feature set) is smaller.

Platform: The biggie. Let's say you're planning to serve XML documents via a Perl-based CGI application. In the grand tradition of Perlians throughout history, you will of course use an existing parser -- say, XML::Parser -- rather than writing your own. (In this case, you'll find that XML::Parser is built on the same expat, written in C, that's at the heart of the Mozilla browser.) Or maybe you're using the Oracle 8 database management system to read in and emit XML from its relational tables -- why bother even looking at some parser other than the one that comes with Oracle, and risking potential incompatibilities? If the application you're working on is end-to-end Microsoft-specific, there's no practical advantage (all other things being equal) to considering a parser other than the one built into MSXML.DLL. And so on.

Three excellent sources of information about the characteristics of different parsers and where to find them are:

Ken Sall's guide on the Web Developer's Virtual Library (WDVL) site.
The XMLsoftware.com site, at http://www.xmlsoftware.com/parsers/.

One final thing to bear in mind when you embark on a search for the "best parser," whatever that means for you: You'll need to limit your search very quickly or go crazy. Back in 1998, within a few months of the XML 1.0 Recommendation's release, one observer reported on XML-DEV that he'd found over 200 parsers (after hitting 200, he gave up counting).