Nobody Asked Me, But...

August 29, 2001

I've been writing the "XML Q&A" column for a year now. An anniversary seems a good occasion to think about the questions I wish I'd been able to answer during that time -- questions no one ever asks.

Let's start with an easy one.

Q: Where do I post a question for consideration in "XML Q&A"?

A: All questions I answer here are posted to O'Reilly Network's XML Forum. (Editor's note: The XML Forum is no longer available. We maintain this content for the sense of historical interest.) You have to register to participate in the forum (and yes, posting questions is considered "participating"). The registration is free, though. If you just want to read messages and replies, no registration is required.

It's not exactly that simple. As of this writing, the forum includes over 1800 posts. I don't try to read all of them. Instead, I scroll through those posted in the previous three or four weeks, looking for message subjects which are intriguingly worded or seem to cover areas I haven't covered before. Another criterion is whether someone else on the Forum has already adequately answered the question. (Forum participants don't have to wait for me to add my two cents to the discussion; they can jump in and reply to any message they want.)

I shy away from some subjects: questions about processing XML with Java, Python, COBOL, Perl, and so on -- programming languages and APIs either outside my area of expertise or, heck, just outside of XML itself; questions about obscure XML vocabularies; questions in languages other than English; questions posted to other venues, like the XML-L, XML-DEV, or XSL-List mailing lists; and questions about server configuration.

That leaves plenty of room for things to wonder about, though, from microscopic nooks and crannies in DTD syntax on up to big hairy issues like W3C-sponsored future directions for XML. And if I don't get around to selecting your question for a given month's list of two or three, don't despair: you can always fall back on the collective wisdom of the Internet, other participants in the O'Reilly Network XML Forum, or on any of the subject-specific mailing lists.

Now let's ratchet up the complexity a bit.

Q: I'm designing my own simple XML vocabulary, but I don't understand either DTDs or XML Schema. What can I do?

A: This really drives me crazy -- maybe even crazier than it drives you. There's a terrific, often overlooked answer. Forget validation. Stick with well-formedness.

A little background first: XML shares with its SGML parent the notion of validating a document against some formal structure. This formal structure can be specified in the form of a Document Type Definition (DTD) -- or, more recently, an XML Schema. By declaring a formal structure, you can declare which elements may fit inside which other elements, how many times the contained element may occur, what attributes the element may or must have, and so on.

But the XML 1.0 Recommendation also parted company with SGML by introducing the concept of well-formedness. A well-formed XML document is to a valid document as a simple e-mail message is to a spell-checked one. There's nothing at all wrong with the former for many (I'd argue most) purposes. A simple, non-spell-checked e-mail message still follows some important structural rules, especially in the form of its headers, attachments, and so on. Checking the spelling makes the message content more reliable, more rigorous if you will. But it doesn't make the document necessarily any better.

I don't mean to downplay reliability and rigor, of course. If your proposed vocabulary involves the movement of money, private or confidential information, state or corporate secrets, and so on, then, yes, you definitely will need to validate your documents at some point.

However, it's not true that you must "get" DTDs or XML Schema in order to "get" XML itself; it's not even required to design a perfectly functional, elegant vocabulary. You'd never believe it, though, based on a cursory scan of many of the "introductory" XML references and tutorials. The W3C has contributed to the confusion, in a way, by insisting that XHTML documents (for instance) "must" at some point validate against a DTD -- or they're not true XHTML. There are sound reasons for this insistence in a language intended for general-purpose Web use, having to do (for instance) with platform independence and reducing browser bloat. Just don't generalize from XHTML's example to conclude that your own special-purpose vocabularies are somehow illegitimate if they can't be validated because there isn't a DTD or Schema document lying around

In a slightly different context, The XML FAQ, edited by Peter Flynn, says,

XML allows groups of people or organizations to create their own customized markup applications for exchanging information in their domain (music, chemistry, electronics, hill-walking, finance, surfing, petroleum geology, linguistics, cooking, knitting, stellar cartography, history, engineering, rabbit-keeping, mathematics, genealogy, etc).

Doesn't that sound like a great world -- one in which nearly every imaginable data management and interchange purpose is served by a single markup standard? Unfortunately, the media's tireless emphasis on large-scale, sprawling B-to-B XML applications -- complex (and yes, fully validated) as they must be -- has dimmed the likelihood of such a world ever coming to pass. Here's what I think: The focus on DTDs and XML Schema as the hallmark of so-called real XML has done more to damage XML's widespread use and popularity than all the usual culprits (proliferation of XML-related standards, proprietary extensions, and so on) combined. Maybe that's just me, though.

Now let's move on to one final, less serious (verging on the loopy) question.

Q: I don't want the purpose, structure, and contents of my XML documents to be easily discernible. Any ideas?

A: One of XML's justifiably much-touted virtues is, well, the understandability of XML documents. Consider the following.

<invoice inv_num="A200106180013"> <customer_num>5738135</customer_num> <payment type="visa" card_num="1234567890"/> <items> <item item_num="098743"> <quantity>1</quantity> <unit_price currency="USD">12.95</unit_price> </item> <item item_num="321822"> <quantity>4</quantity> <unit_price currency="USD">895.95</unit_price> </item> </items> </invoice>

There's little doubt what's going on here. Even if certain details are missing -- who exactly customer #5738135 is, for example, or what product item #321822 might be -- it's pretty obvious that some customer is charging $3,596.75 for something against Visa card #1234567890.

The fact that we can figure this out from the above -- its transparency -- results from several of this document's features:

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

The element and attribute names are in something like a recognizable human language (English, in this case).
The element text content (#PCDATA) and attribute values are in a form which confirm any guesses we have to make about the meaning of element and attribute names.
The document is pretty-printed, with liberal use of newlines and other whitespace which instantly reveals the relationship of one element to another.

This is all well and good if I work at a retail sales counter and you work in order fulfillment or billing, and I just physically hand the document to you over the cubicle wall. It's not so good, though, if I need to transmit the order information over anything other than a tightly secured electronic connection. Maybe we can "secure" this document in a way which can be easily encoded and decoded, without resorting to cumbersome (albeit supremely effective!) encryption measures. This will involve three steps, each of which "breaks" one of the document's transparent features.

First, strip out all the excess whitespace.

<invoice inv_num="A200106180013"><customer_num>5738135</customer_num><payment type="visa" card_num="1234567890"/><items><item item_num="098743"><quantity>1</quantity><unit_price currency="USD">12.95</unit_price></item><item item_num="321822"><quantity>4</quantity><unit_price currency="USD">895.95</unit_price></item></items></invoice>

This alone doesn't secure the contents in any real way. It does make the structure less "visible" to a casual human reader; still, a program or just an XSLT stylesheet could put all that visible structure back into the document again.

Truly securing the structure and data is simply a matter of obscuring it with some convenient algorithm. This algorithm will at a minimum alter the element and attribute names and their contents. As an extremely simple demonstration, consider the old Usenet "ROT-13" trick. The term ROT-13 comes from the device of rotating the letters of the English alphabet 13 positions to the right, as if they were on a big wheel. Capital "A" becomes capital "N," lowercase "c" becomes "p," "s" becomes "f," and so on. Here's an XSLT template to do this conversion:

<xsl:template match="*">  <xsl:variable name="rot13_elem" select="translate(name(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', 'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm')"/>  <xsl:element name="{$rot13_elem}">  <xsl:for-each select="@*"> <xsl:variable name="rot13_attr" select="translate(name(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', 'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm')"/> <xsl:attribute name="{$rot13_attr}"><xsl:value-of select="."/></xsl:attribute> </xsl:for-each>  <xsl:apply-templates/> </xsl:element> </xsl:template>

Running this stylesheet against the whitespace-stripped XML document produces the following result tree:

<vaibvpr
                        vai_ahz="A200106180013"><phfgbzre_ahz>5738135</phfgbzre_ahz><cnlzrag
                        glcr="visa" pneq_ahz="1234567890"/><vgrzf><vgrz
                        vgrz_ahz="098743"><dhnagvgl>1</dhnagvgl><havg_cevpr
                        pheerapl="USD">12.95</havg_cevpr></vgrz><vgrz
                        vgrz_ahz="321822"><dhnagvgl>4</dhnagvgl><havg_cevpr
                        pheerapl="USD">895.95</havg_cevpr></vgrz></vgrzf></vaibvpr>

This isn't perfectly opaque yet; the same kind of transformation should be applied to the element text content and attribute values (especially the "visa" and "USD"). But it's darned close to the desired level of senselessness. And to convert it back to its meaningful form at the other end, all you've got to do is run the same or similar XSLT transformation against the result.

(If you really want to do something like this to your own XML documents, I don't seriously suggest using the ROT-13 algorithm. It's all right as a demonstration, but would be easily cracked by any respectable decryption tool.)

So that's my anniversary unasked questions column. Next month, it's back to questions someone out there really needs to have answered.