XML Questions Answered
XML.com receives dozens of questions each week about XML, submitted via the FAQ submission form at http://faq.oreillynet.com/xml. In this column we'll be addressing some of the most interesting (read: useful, provocative, outright bizarre) of those questions.
Q: How do I convert my existing HTML documents into XML?
A: The most straightforward solution is to use Dave Raggett's popular program, HTML Tidy (or plain old "Tidy"), available from the W3C web site at http://www.w3.org/People/Raggett/tidy/.
Tidy is a command-line utility which runs on a wide variety of
operating systems; it uses various command-line switches
(parameters) to control its processing. At a minimum, it simply
cleans up your HTML by ensuring that elements are properly nested
and so on; it also warns you if your HTML uses non-standard code
that's likely to cause cross-browser compatibility problems. One of
the most useful command-line options is -asxml
("as XML," see?), which does what you seem to be asking.
It will properly balance elements, per usual, but it also adds some
extra information to the document. For instance, it tacks on an XML
declaration, <?xml version="1.0"?>, and a
<!DOCTYPE...> statement, which unambiguously mark this
as an XML document. To the root html element it also
adds a namespace-declaring attribute that identifies all elements
in the document as conforming to the specific XML vocabulary known
as XHTML. It even forces all element names to lowercase, since the
XHTML standard requires it.
If you're asking about converting HTML to a less generic form of
XML than XHTML, your task may turn out to be quite complex. For
example, if you've been using HTML to mark up customer invoices,
not only the customer's name but also their number, item(s)
ordered, quantity, and price are probably all wrapped up inside
<p> and </p> tags. How do you
know which "kind of paragraph" contains a given kind of
information, so you can turn one instance of the p
element into a custname element, another into
custnumber, another into price, and so on? If
you've been using CSS for styling your HTML, you may have supplied
the different p elements with
class="custname" (etc.) attributes and so on; if that's the
case, you may be able to generate meaningful XML using an XSLT
stylesheet. There may also be customized software to do the sort of
conversion you want. Otherwise you're probably looking down the
barrel of an ugly gun.
Q: What exactly do you mean by "markup"? Can you please explain using a simple example?
A: Markup is all the text in an XML, HTML, or SGML document other than what you might normally think of as the document's content. Pieces of markup punctuate the content, as it were -- add meaning or structure to what the document otherwise says. For example, consider an XML document whose content includes nominally just a single sentence: "Bolivia exports tin to Curaçao."
In marked-up form, this document might look like the following:
<?xml version="1.0"?> <!DOCTYPE para> <!-- What does Bolivia export, and whither? --> <para xml:lang="en">Bolivia exports tin to Curaçao.</para>
All text beyond that of our original simple
sentence (or in place of, in the
case of "ç") is markup. Here, the markup declares that the document
conforms to the XML 1.0 Recommendation, and that its actual content
will be contained wholly within the scope of an element named
"para"; provides a human-readable comment; marks the starting point
of the document content; declares that, unless otherwise stated, the
language of this document is English; provides a "portable"
reference to a special character, ç, which might not
otherwise be understood the same way by all XML processing
software, in all environments; and marks the conclusion of the
document content.
Q: Are there any weaknesses of XML?
A: Oh, yes. There are any number of reasons why you might not choose XML over some other data representation format. (Interestingly, some of these "weaknesses" are a result of intentional design decisions. In other words, as certain software vendors are wont to claim, "That's not a bug. That's a feature.") Here are a handful of them:
XML markup can be incredibly verbose, depending on the
vocabulary in question. For instance, what HTML refers to as the
p element might show up in an XML counterpart as
para or paragraph. (Things can get even
worse if the element and attribute names include namespace
prefixes.) This can make the markup much more accessible to human
readers; unfortunately, it also can make the actual content much
harder to read and even (if the markup is done "by hand") much
harder to mark up in the first place. Then there's the bandwidth
question: How much of your wire are you willing to dedicate to
carrying markup, as opposed to true content?
XML is platform-neutral. On the face of it, this is laudable. It also diminishes how much performance can be wrung out of a true-blue XML application (since you can't take advantage of platform-specific tricks like compression and other binary formats). Furthermore, fully supporting any Unicode encoding probably adds all kinds of cruft to an application that may be used seldom, if ever, in a given installation of XML processing software.
All the pieces aren't yet in place to do whatever you want with XML--certainly not in a fully standards-compliant form, anyhow. We've got XSLT for transforming the structure of XML documents, but XSL itself--the formatting component, what most people think of when they hear the word "stylesheet"--still hasn't been finalized. We've got XPath for telling us how to get around within an XML document, but we're still waiting for a final XLink to tell us how to get to a document in the first place. DTDs enable us to do some rudimentary sorts of validity checking; but it will take the still-unfinished XSchema to bring XML onto anything like a par with the built-in type checking familiar to database developers. The list goes on... and every day, it seems, a new version of a new standard is announced. (If you're looking for software to support it all, well, you've got a long wait ahead of you!)
And last but certainly not least:
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.