XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Will XML replace HTML?

December 13, 2000

This month, we tackle two related, neurosis-inducing questions common to Web developers just dipping their toes into XML.

Q: Will XML ever replace HTML?

A: Two answers, one philosophic and one pragmatic.

The philosophic answer is that XML isn't really meant, except incidentally, as a replacement for HTML. While the XML 1.0 Recommendation was under development, it was sometimes referred to as "SGML for the Web," and some residue of that perception remains. XML is superbly adaptable to the Web, true. But given the number of of XML-based markup languages possible (and the thousands already extant, for that matter), no imaginable Web browser could possibly figure out how to render all the corresponding documents. For example, what's a conventional browser to make of an <employee emdID="emp73519"> tag? or an <invoice_num>? let alone an <aperçu>?

That leads us to the more pragmatic answer: XML is already replacing HTML...sort of.

You probably know that the World Wide Web Consortium (W3C) is responsible for the HTML standard, currently at version 4.01. What you may not know is that the W3C plans to release no further enhancements to HTML; instead, it has approved a Recommendation for version 1.0 of what it's dubbed Extensible HTML or XHTML. Henceforth, all future modifications to the Web's lingua franca will be made to the XHTML standard, not to HTML itself.

So what's XHTML? At root, it's simply an XML-ized form of HTML. Start tags are balanced with end tags, for instance, and all elements must nest correctly within one another. Gone are the "bad old days" of overlapping structures like

<b>This is bold <i>and this is italic</b>
but the italicized element doesn't nest properly within
the bold.</i>

There are other significant differences between XHTML and HTML, some of them highlighted in the answer to the next question.

XHTML 1.0 comes in three "flavors" (the W3C's word, in a rare moment of whimsy):

  • XHTML Transitional: This is a good choice for existing HTML documents which you want to convert to XHTML. For instance, strictly speaking, XML enforces a strict separation between content or structure and the manner in which it's displayed. Therefore, strictly speaking, an HTML tag such as <body bgcolor="#F0F0F0"> has no place in an XML document. A truly XML-based form of HTML would require presentation characteristics (like the bgcolor attribute) to be represented in a stylesheet apart from the document itself. XHTML Transitional relaxes this requirement, which makes it more likely that legacy browsers (which may or may not support stylesheets) will continue to work as expected.
  • XHTML Strict: Here, the kid gloves come off. All presentation-related markup is banned; if you want a particular element to be displayed in a particular way, you must use a stylesheet.
  • XHTML Frameset: If you need to use frames in constructing your Web pages, use this version of XHTML 1.0 for the frameset itself. (Contents of the individual frames will be marked up in one of the other two flavors.)

Q: Is it possible to change an HTML-based web page into XML?

A: Again, two answers. One assumes that you simply want to use XHTML (see previous answer); the other, that you want to convert your HTML into some truly semantically meaningful, application-specific markup language such as MathML, the Chemical Markup Language, or one of your own devising.

First, let's take a look at converting HTML to XHTML. The W3C's XHTML Recommendation provides a convenient list of the differences between the two languages. A couple of the obvious differences were mentioned above, but there are also some that will surprise HTML developers. For instance, XML element names are case sensitive; in a hypothetical XML-ized form of HTML the <img> tag would represent a different element type than <IMG>, <Img>, etc. So the XHTML Recommendation's authors tossed a coin, as it were, and opted for all-lowercase element names. Also, empty elements (those represented in HTML as <img>, <link>, <hr>, <br> and so on) must use the special XML empty-element tag form, with a slash (/) before the closing >. The <br> tag becomes <br/>, <hr> becomes <hr/>, and so on.

(Note: Browsers that don't understand XML have a habit of choking on this empty element form. By a happy accident, though, they don't choke on it if you precede the slash with a space: <br /> instead of <br/>, <hr /> instead of <hr/>, etc. This is one of the few cases I can think of in which we can be grateful for the lazy markup encouraged by browser vendors. And it's the way to go if writing your own backwards-compatible XHTML.)

By far, the simplest means to converting your existing HTML documents into their XHTML form is to use Dave Raggett's free HTML Tidy utility, available at the W3C site. Tidy runs on a wide variety of platforms and accepts an almost dizzying array of command-line parameters which direct its processing. A number of vendors and developers have also integrated Tidy into their own products. (On Windows-based machines, a popular such tool is Chami.com's free HTML-Kit.)

But then there's the more vexing question: What if you want to convert your HTML documents not to XHTML, but to some true XML application?

This question vexes because HTML element names (or XHTML ones, for that matter) have no inherent meaning, which is the hallmark of XML applications as we normally think of them. Let's say you've got an (X)HTML fragment which looks like this:

    <td>Charles Darwin</td>
    <td>Origin of Species</td>
    <td>Joseph Heller</td>

Converted to a customized XML application, this might be represented something like

    <author>Charles Darwin</author>
    <title>Origin of Species</title>
    <author>Joseph Heller</author>

That's pretty straightforward, right? But would you really want to convert -- using, say, a standard search-and-replace operation -- every single <table> tag in the original documents to <books>, <tr> to <book>, the first occurrence of each <td> within a <tr> to <author>, and the second occurrence of each <td> within a <tr> to <title>? Not very likely! (At the very least, the first table row containing three or more <td> elements would break your little scheme outright.)

Now if your existing HTML is very carefully marked up, especially using class attributes on every instance of every element type (for use with CSS stylesheets, for instance), this might work. Consider

<table class="books">
  <tr class="book">
    <td class="author">Charles Darwin</td>
    <td class="title">Origin of Species</td>
  <tr class="book">
    <td class="author">Joseph Heller</td>
    <td class="title">Catch-22</td>

See? Then you could do your search-and-replace operation easily...well, to correct the start tags, anyway. (And assuming, of course, that you'd been sufficiently neurotic to include all these class attributes in the first place.)

Almost certainly, though, you'd need instead to undertake a very painstaking (and perhaps painful) analysis of your document structure and a mapping of that structure into semantically meaningful markup, followed by a difficult manual conversion effort. This is not an assignment most of us would want to see on our list of job objectives for the coming year. But yes, it's possible to change HTML into XML like this.

For further information about XHTML, read XHTML: The Clean Code Solution.