XML.com 
 Published on XML.com http://www.xml.com/pub/a/2000/07/26/qanda/index.html
See this if you're having trouble printing code examples

 

XML Questions Answered
By John E. Simpson
July 26, 2000


XML.com receives dozens of questions each week about XML, submitted via the FAQ submission form at http://faq.oreillynet.com/xml. In this column we'll be addressing some of the most interesting (read: useful, provocative, outright bizarre) of those questions.

HTML to XML?

Q: How do I convert my existing HTML documents into XML?

A: The most straightforward solution is to use Dave Raggett's popular program, HTML Tidy (or plain old "Tidy"), available from the W3C web site at http://www.w3.org/People/Raggett/tidy/.

Tidy is a command-line utility which runs on a wide variety of operating systems; it uses various command-line switches (parameters) to control its processing. At a minimum, it simply cleans up your HTML by ensuring that elements are properly nested and so on; it also warns you if your HTML uses non-standard code that's likely to cause cross-browser compatibility problems. One of the most useful command-line options is -asxml ("as XML," see?), which does what you seem to be asking. It will properly balance elements, per usual, but it also adds some extra information to the document. For instance, it tacks on an XML declaration, <?xml version="1.0"?>, and a <!DOCTYPE...> statement, which unambiguously mark this as an XML document. To the root html element it also adds a namespace-declaring attribute that identifies all elements in the document as conforming to the specific XML vocabulary known as XHTML. It even forces all element names to lowercase, since the XHTML standard requires it.

If you're asking about converting HTML to a less generic form of XML than XHTML, your task may turn out to be quite complex. For example, if you've been using HTML to mark up customer invoices, not only the customer's name but also their number, item(s) ordered, quantity, and price are probably all wrapped up inside <p> and </p> tags. How do you know which "kind of paragraph" contains a given kind of information, so you can turn one instance of the p element into a custname element, another into custnumber, another into price, and so on? If you've been using CSS for styling your HTML, you may have supplied the different p elements with class="custname" (etc.) attributes and so on; if that's the case, you may be able to generate meaningful XML using an XSLT stylesheet. There may also be customized software to do the sort of conversion you want. Otherwise you're probably looking down the barrel of an ugly gun.

"Markup"? Say what?

Q: What exactly do you mean by "markup"? Can you please explain using a simple example?

A: Markup is all the text in an XML, HTML, or SGML document other than what you might normally think of as the document's content. Pieces of markup punctuate the content, as it were -- add meaning or structure to what the document otherwise says. For example, consider an XML document whose content includes nominally just a single sentence: "Bolivia exports tin to Curaçao."

In marked-up form, this document might look like the following:

<?xml version="1.0"?>
<!DOCTYPE para>
<!-- What does Bolivia export, and whither? -->
<para xml:lang="en">Bolivia exports tin to Cura&#231;ao.</para>

All text beyond that of our original simple sentence (or in place of, in the case of "&#231;") is markup. Here, the markup declares that the document conforms to the XML 1.0 Recommendation, and that its actual content will be contained wholly within the scope of an element named "para"; provides a human-readable comment; marks the starting point of the document content; declares that, unless otherwise stated, the language of this document is English; provides a "portable" reference to a special character, ç, which might not otherwise be understood the same way by all XML processing software, in all environments; and marks the conclusion of the document content.

The Dark Side

Q: Are there any weaknesses of XML?

A: Oh, yes. There are any number of reasons why you might not choose XML over some other data representation format. (Interestingly, some of these "weaknesses" are a result of intentional design decisions. In other words, as certain software vendors are wont to claim, "That's not a bug. That's a feature.") Here are a handful of them:

And last but certainly not least:

XML.com Copyright © 1998-2006 O'Reilly Media, Inc.