XML Questions Answered
July 26, 2000
XML.com receives dozens of questions each week about XML, submitted via the FAQ submission form at http://faq.oreillynet.com/xml. In this column we'll be addressing some of the most interesting (read: useful, provocative, outright bizarre) of those questions.
HTML to XML?
Q: How do I convert my existing HTML documents into XML?
A: The most straightforward solution is to use Dave Raggett's popular program, HTML Tidy (or plain old "Tidy"), available from the W3C web site at http://www.w3.org/People/Raggett/tidy/.
Tidy is a command-line utility which runs on a wide variety of operating systems;
various command-line switches (parameters) to control its processing. At a minimum,
simply cleans up your HTML by ensuring that elements are properly nested and so on;
warns you if your HTML uses non-standard code that's likely to cause cross-browser
compatibility problems. One of the most useful command-line options is
("as XML," see?), which does what you seem to be asking. It will properly balance
elements, per usual, but it also adds some extra information to the document. For
it tacks on an XML declaration,
<?xml version="1.0"?>, and a
<!DOCTYPE...> statement, which unambiguously mark this as an XML document. To
html element it also adds a namespace-declaring attribute that
identifies all elements in the document as conforming to the specific XML vocabulary
as XHTML. It even forces all element names to lowercase, since the XHTML standard
If you're asking about converting HTML to a less generic form of XML than XHTML, your
may turn out to be quite complex. For example, if you've been using HTML to mark up
invoices, not only the customer's name but also their number, item(s) ordered, quantity,
price are probably all wrapped up inside
tags. How do you know which "kind of paragraph" contains a given kind of information,
can turn one instance of the
p element into a
custnumber, another into
price, and so on? If you've
been using CSS for styling your HTML, you may have supplied the different
class="custname" (etc.) attributes and so on; if that's the
case, you may be able to generate meaningful XML using an XSLT stylesheet. There may
customized software to do the sort of conversion you want. Otherwise you're probably
down the barrel of an ugly gun.
"Markup"? Say what?
Q: What exactly do you mean by "markup"? Can you please explain using a simple example?
A: Markup is all the text in an XML, HTML, or SGML document other than what you might normally think of as the document's content. Pieces of markup punctuate the content, as it were -- add meaning or structure to what the document otherwise says. For example, consider an XML document whose content includes nominally just a single sentence: "Bolivia exports tin to Curaçao."
In marked-up form, this document might look like the following:
<?xml version="1.0"?> <!DOCTYPE para> <!-- What does Bolivia export, and whither? --> <para xml:lang="en">Bolivia exports tin to Curaçao.</para>
All text beyond that of our original simple sentence (or in place of, in the case
ç") is markup. Here, the markup declares that the document conforms
to the XML 1.0 Recommendation, and that its actual content will be contained wholly
the scope of an element named "para"; provides a human-readable comment; marks the
point of the document content; declares that, unless otherwise stated, the language
document is English; provides a "portable" reference to a special character, ç, which
might not otherwise be understood the same way by all XML processing software, in
environments; and marks the conclusion of the document content.
The Dark Side
Q: Are there any weaknesses of XML?
A: Oh, yes. There are any number of reasons why you might not choose XML over some other data representation format. (Interestingly, some of these "weaknesses" are a result of intentional design decisions. In other words, as certain software vendors are wont to claim, "That's not a bug. That's a feature.") Here are a handful of them:
XML markup can be incredibly verbose, depending on the vocabulary in question. For instance, what HTML refers to as the
pelement might show up in an XML counterpart as
paragraph. (Things can get even worse if the element and attribute names include namespace prefixes.) This can make the markup much more accessible to human readers; unfortunately, it also can make the actual content much harder to read and even (if the markup is done "by hand") much harder to mark up in the first place. Then there's the bandwidth question: How much of your wire are you willing to dedicate to carrying markup, as opposed to true content?
XML is platform-neutral. On the face of it, this is laudable. It also diminishes how much performance can be wrung out of a true-blue XML application (since you can't take advantage of platform-specific tricks like compression and other binary formats). Furthermore, fully supporting any Unicode encoding probably adds all kinds of cruft to an application that may be used seldom, if ever, in a given installation of XML processing software.
All the pieces aren't yet in place to do whatever you want with XML--certainly not in a fully standards-compliant form, anyhow. We've got XSLT for transforming the structure of XML documents, but XSL itself--the formatting component, what most people think of when they hear the word "stylesheet"--still hasn't been finalized. We've got XPath for telling us how to get around within an XML document, but we're still waiting for a final XLink to tell us how to get to a document in the first place. DTDs enable us to do some rudimentary sorts of validity checking; but it will take the still-unfinished XSchema to bring XML onto anything like a par with the built-in type checking familiar to database developers. The list goes on... and every day, it seems, a new version of a new standard is announced. (If you're looking for software to support it all, well, you've got a long wait ahead of you!)
And last but certainly not least:
- Many people's expectations are too high, not just for XML but for any heavily-marketed technology du jour. If your boss has read that XML (or whatever) will cure world hunger, it will do you no good to know otherwise. And, needless to say, the world's hungry will be just as underfed as ever.