Basic Training

March 27, 2002

In this month's column, we celebrate XML's fourth year (belatedly) by way of a deceptively simple question.

Q: What is XML?

Simply, please -- I am Chinese and my English skills are not good.

A: For Chinese-language information about XML, visit the Chinese XML Now! site, sponsored by Academia Sinica. For English-language information about Chinese-language XML documents (for example, encoding and ideographic character representation), I recommend starting with the Chinese XML FAQ. (Similar sites exist for many other non-English languages.) You might also want to check the W3C's list of translations of XML-related W3C publications. These include both Simplified and Traditional Chinese translations of the XML 1.0 Recommendation itself.

I must confess, your question got me thinking: Is it even possible to explain XML in simple English -- especially in the limited space of an XML Q&A column? Let's give it a try....

XML (an acronym for Extensible Markup Language) is a set of rules, published by the W3C (World Wide Web Consortium), for building new languages. The languages in question are not written or spoken primarily for human consumption; they're intended to simplify -- and simultaneously enrich -- information sharing among software and humans. These languages, and the documents in which they're "written," all share some common characteristics.

Plain text

If you can use a computer keyboard -- even a simple typewriter, or for that matter a pencil and paper -- to type a letter, a report, even a simple word or phrase, then you can create an XML document. Note that the keyboard or typewriter does not need to be a Western-style one, with the letters Q, W, E, R, T, and Y for instance along the top left row, in that order. The "letters" can be Cyrillic or Traditional Chinese or what-have-you.

For reasons which I'll explain in a moment, the only characters you must be able to represent are the less-than and greater-than "angle brackets," < and >; the forward slash, /; the single quotation mark or apostrophe, '; the double quotation mark, "; the ampersand, &; and the semi-colon, ;. Some optional features of XML will be accessible only if you also have such punctuation marks as an exclamation point, !; a dollar sign, $; and an @ sign.

Why is plain text so important? For two reasons. First, the whole body of human knowledge, experience, and understanding can be expressed in terms of plain text; if something cannot be so expressed, it might as well not exist. (This is true of pictures and other non-verbal forms of communication as well as words. An image of a circle can be described, for example, as having a radius of some value and a center placed at some X,Y coordinate, with a fill color of red or blue. Even an animation or musical score is in principle expressible as plain text, although it's likely to be quite complex.)

Second, plain text is open: no special software is required to make it meaningful to human readers. Of course, the definition of "meaningful" might change, depending on the nature of the specific text. The text might be filled with jargon, or it might consist of numbers and mathematical symbols, or it might not be in a human language which you can understand. The point is that if you know and understand both the individual characters and the general context in which they appear, you can "read" an XML document. You don't need to have a copy of WordImperfect 2002 (let alone Version X of it), and you don't need to own or even have access to a PC, or a Mac, or a Web server. All you need is a printed copy of the document -- printed on an expensive printer or hand-written or anywhere in between -- or a copy of it displayed on a computer monitor or projection screen.

Given all the above, the following might be the foundation for a simple XML document:

Benedict Arnold didn't cross the Delaware; he crossed his country.

All the characters in that text came straight from my keyboard, and if you can read English you can get some "meaning" out of the sentence as a whole. If you know a bit about American history (who was Benedict Arnold? what is "the Delaware"?) you'll derive more meaning from it. And if you also know something about American slang (the dual meanings of the verb "cross"), you'll get even more meaning from it. The point is that the meaning of the text, literal or metaphorical, is supplied by a human reader, not inherent in the plain text itself, but it is the plain text which permits a human reader to supply that meaning, without need of any special whiz-bang software.

Markup

Aside from the fact that it can be expressed as plain text, meaningful information is also structured.

In college, I had a basic course in semantics (the study of how meaning is represented in language). Our professor presented us with three nonsense words:

garvin jamling trixles

She challenged us to make any kind of sense at all of these words. Pretty impossible. We could figure out, sort of, that "trixles" might be a plural noun (or maybe it was a present-tense, third-person-singular verb?), and "jamling" -- with that "-ing" -- seemed to be a verb form (a gerund or gerundive perhaps). But "garvin" defeated us, utterly. Then she rearranged the three words and added a few more:

The garvin was jamling on the trixles.

We still didn't know what a garvin was, and the concept of one -- let alone several -- trixles remained a mystery, and if we'd ever seen the former jamling on the latter we couldn't say. Still, it was amazing how the addition of those simple little words -- "the," "was," and "on" -- suddenly made a kind of "meaning" snap into place. Suddenly, the relationship among the three individual words -- the structure of the information -- was revealed.

An XML document does exactly that with its plain-text contents: it scatters little verbal signposts (like those function words in the nonsense sentence) among the content, imposing on it a structure which is immediately understandable even if what is being structured is not obvious. These signposts are collectively referred to as markup. And here's where those special characters I mentioned above come into play. The most important such characters -- no XML document does not include them -- are the <, >, and / (less-than, greater-than, and slash, respectively). Here's an XML-ified version of the Benedict Arnold sentence above, with the markup in boldface:

<sentence><clause>Benedict Arnold didn<punctuation type="apostrophe"/>t cross the Delaware<punctuation type="semi-colon"/></clause><clause>he crossed his country<punctuation type="period"/></clause></sentence>

The markup in this "XML document" is contained within the angle brackets. I'll talk some more about the specifics of this markup in the next section. For now, just notice that the markup breaks up the overall sentence into smaller chunks, in a nested structure. Often this structure is made more obvious for legibility using line breaks and spaces, like this:

<sentence> <clause>Benedict Arnold didn<punctuation type="apostrophe"/>t cross the Delaware<punctuation type="semi-colon"/></clause> <clause>he crossed his country<punctuation type="period"/></clause> </sentence>

See? Each clause is subordinate to the overall sentence, and within a clause may be a mixture of the plain text and punctuation. (The punctuation could have been left as literal text, rather than defined via markup; I'm using markup here in order to make a different point in a moment.) Furthermore, the markup itself is human-readable: anyone with an elementary understanding of English grammar knows what the words "sentence," "clause," and "punctuation" mean.

So that's lesson 2 about XML: it delimits blocks of content with intelligible, structure-defining markup to add meaning to the content itself.

Well-formedness

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

Each piece of markup enclosed in angle brackets is called a tag. Note that much of the markup in our Benedict-Arnold sentence is balanced: there's a <sentence> tag at the beginning paired with a </sentence> tag at the end, and each <clause> tag has a matching </clause> tag. Tags come either in pairs called the start tag and end tag (the latter containing a / just after the opening <) or in a special standalone "empty" form (like the tags for the punctuation elements). Each pair of tags or empty tag identifies a particular "thing" called an element. (The elements in this excerpt are named sentence, clause, and punctuation.) An element's start tag, or an empty tag, may also include attributes -- the name-value bit of text such as type="apostrophe" -- which provide meaning or content of their own.

All of these are examples of what's referred to as well-formedness: the specific rules with which all XML documents must comply in order to be minimally legitimate XML. Other examples include:

element and attribute names are case-sensitive (a SENTENCE element is not the same as a sentence element), and the corresponding markup is as well;
attribute values must be enclosed in single or double quotation marks; and
most importantly, the nesting of one element within another, as defined by the placement of tags, is precise. Every start tag must be balanced with one end tag, and no overlap of the boundaries between one element and the next is permitted.

Implicit in that last point, by the way, is that each well-formed XML document has one and only one "outermost element," within which all the others are nested. This outermost element is called the root element.

So with lesson 3, you now know that an XML document is a string of plain text, delimited by markup, in a well-structured form including a single root element and others, nested inside one another.

This quick brush-stroke view of XML can't even begin to explain why there's been so much attention paid to XML over the last four years, why you should care about it, or how to use it. But it does sketch for you the answer to the question "What is XML?"... in about 1500 words.