Menu

XML Questions Answered

July 26, 2000

John E. Simpson



XML.com receives dozens of questions each week about XML, submitted via the FAQ submission form at http://faq.oreillynet.com/xml. In this column we'll be addressing some of the most interesting (read: useful, provocative, outright bizarre) of those questions.

HTML to XML?

Q: How do I convert my existing HTML documents into XML?

A: The most straightforward solution is to use Dave Raggett's popular program, HTML Tidy (or plain old "Tidy"), available from the W3C web site at http://www.w3.org/People/Raggett/tidy/.

Tidy is a command-line utility which runs on a wide variety of operating systems; it uses various command-line switches (parameters) to control its processing. At a minimum, it simply cleans up your HTML by ensuring that elements are properly nested and so on; it also warns you if your HTML uses non-standard code that's likely to cause cross-browser compatibility problems. One of the most useful command-line options is -asxml ("as XML," see?), which does what you seem to be asking. It will properly balance elements, per usual, but it also adds some extra information to the document. For instance, it tacks on an XML declaration, <?xml version="1.0"?>, and a <!DOCTYPE...> statement, which unambiguously mark this as an XML document. To the root html element it also adds a namespace-declaring attribute that identifies all elements in the document as conforming to the specific XML vocabulary known as XHTML. It even forces all element names to lowercase, since the XHTML standard requires it.

If you're asking about converting HTML to a less generic form of XML than XHTML, your task may turn out to be quite complex. For example, if you've been using HTML to mark up customer invoices, not only the customer's name but also their number, item(s) ordered, quantity, and price are probably all wrapped up inside <p> and </p> tags. How do you know which "kind of paragraph" contains a given kind of information, so you can turn one instance of the p element into a custname element, another into custnumber, another into price, and so on? If you've been using CSS for styling your HTML, you may have supplied the different p elements with class="custname" (etc.) attributes and so on; if that's the case, you may be able to generate meaningful XML using an XSLT stylesheet. There may also be customized software to do the sort of conversion you want. Otherwise you're probably looking down the barrel of an ugly gun.

"Markup"? Say what?

Q: What exactly do you mean by "markup"? Can you please explain using a simple example?

A: Markup is all the text in an XML, HTML, or SGML document other than what you might normally think of as the document's content. Pieces of markup punctuate the content, as it were -- add meaning or structure to what the document otherwise says. For example, consider an XML document whose content includes nominally just a single sentence: "Bolivia exports tin to Curaçao."

In marked-up form, this document might look like the following:


<?xml version="1.0"?>

<!DOCTYPE para>

<!-- What does Bolivia export, and whither? -->

<para xml:lang="en">Bolivia exports tin to Cura&#231;ao.</para>

All text beyond that of our original simple sentence (or in place of, in the case of "&#231;") is markup. Here, the markup declares that the document conforms to the XML 1.0 Recommendation, and that its actual content will be contained wholly within the scope of an element named "para"; provides a human-readable comment; marks the starting point of the document content; declares that, unless otherwise stated, the language of this document is English; provides a "portable" reference to a special character, ç, which might not otherwise be understood the same way by all XML processing software, in all environments; and marks the conclusion of the document content.

The Dark Side

Q: Are there any weaknesses of XML?

A: Oh, yes. There are any number of reasons why you might not choose XML over some other data representation format. (Interestingly, some of these "weaknesses" are a result of intentional design decisions. In other words, as certain software vendors are wont to claim, "That's not a bug. That's a feature.") Here are a handful of them:

  • XML markup can be incredibly verbose, depending on the vocabulary in question. For instance, what HTML refers to as the p element might show up in an XML counterpart as para or paragraph. (Things can get even worse if the element and attribute names include namespace prefixes.) This can make the markup much more accessible to human readers; unfortunately, it also can make the actual content much harder to read and even (if the markup is done "by hand") much harder to mark up in the first place. Then there's the bandwidth question: How much of your wire are you willing to dedicate to carrying markup, as opposed to true content?

  • XML is platform-neutral. On the face of it, this is laudable. It also diminishes how much performance can be wrung out of a true-blue XML application (since you can't take advantage of platform-specific tricks like compression and other binary formats). Furthermore, fully supporting any Unicode encoding probably adds all kinds of cruft to an application that may be used seldom, if ever, in a given installation of XML processing software.

  • All the pieces aren't yet in place to do whatever you want with XML--certainly not in a fully standards-compliant form, anyhow. We've got XSLT for transforming the structure of XML documents, but XSL itself--the formatting component, what most people think of when they hear the word "stylesheet"--still hasn't been finalized. We've got XPath for telling us how to get around within an XML document, but we're still waiting for a final XLink to tell us how to get to a document in the first place. DTDs enable us to do some rudimentary sorts of validity checking; but it will take the still-unfinished XSchema to bring XML onto anything like a par with the built-in type checking familiar to database developers. The list goes on... and every day, it seems, a new version of a new standard is announced. (If you're looking for software to support it all, well, you've got a long wait ahead of you!)

And last but certainly not least:

  • Many people's expectations are too high, not just for XML but for any heavily-marketed technology du jour. If your boss has read that XML (or whatever) will cure world hunger, it will do you no good to know otherwise. And, needless to say, the world's hungry will be just as underfed as ever.