XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Nobody Asked Me, But...
by John E. Simpson | Pages: 1, 2

Q: I don't want the purpose, structure, and contents of my XML documents to be easily discernible. Any ideas?

A: One of XML's justifiably much-touted virtues is, well, the understandability of XML documents. Consider the following.

<invoice inv_num="A200106180013">
<customer_num>5738135</customer_num>
<payment type="visa" card_num="1234567890"/>
<items>
<item item_num="098743">
<quantity>1</quantity>
<unit_price currency="USD">12.95</unit_price>
</item>
<item item_num="321822">
<quantity>4</quantity>
<unit_price currency="USD">895.95</unit_price>
</item>
</items>
</invoice>

There's little doubt what's going on here. Even if certain details are missing -- who exactly customer #5738135 is, for example, or what product item #321822 might be -- it's pretty obvious that some customer is charging $3,596.75 for something against Visa card #1234567890.

The fact that we can figure this out from the above -- its transparency -- results from several of this document's features:

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

  • The element and attribute names are in something like a recognizable human language (English, in this case).
  • The element text content (#PCDATA) and attribute values are in a form which confirm any guesses we have to make about the meaning of element and attribute names.
  • The document is pretty-printed, with liberal use of newlines and other whitespace which instantly reveals the relationship of one element to another.

This is all well and good if I work at a retail sales counter and you work in order fulfillment or billing, and I just physically hand the document to you over the cubicle wall. It's not so good, though, if I need to transmit the order information over anything other than a tightly secured electronic connection. Maybe we can "secure" this document in a way which can be easily encoded and decoded, without resorting to cumbersome (albeit supremely effective!) encryption measures. This will involve three steps, each of which "breaks" one of the document's transparent features.

First, strip out all the excess whitespace.

<invoice inv_num="A200106180013"><customer_num>5738135</customer_num><payment type="visa" card_num="1234567890"/><items><item item_num="098743"><quantity>1</quantity><unit_price currency="USD">12.95</unit_price></item><item item_num="321822"><quantity>4</quantity><unit_price currency="USD">895.95</unit_price></item></items></invoice>

This alone doesn't secure the contents in any real way. It does make the structure less "visible" to a casual human reader; still, a program or just an XSLT stylesheet could put all that visible structure back into the document again.

Truly securing the structure and data is simply a matter of obscuring it with some convenient algorithm. This algorithm will at a minimum alter the element and attribute names and their contents. As an extremely simple demonstration, consider the old Usenet "ROT-13" trick. The term ROT-13 comes from the device of rotating the letters of the English alphabet 13 positions to the right, as if they were on a big wheel. Capital "A" becomes capital "N," lowercase "c" becomes "p," "s" becomes "f," and so on. Here's an XSLT template to do this conversion:

<!-- Match each element in the source tree, regardless of name -->
<xsl:template match="*">

<!-- Assign a variable for the ROT-13 version of the element's name -->
<xsl:variable name="rot13_elem"
select="translate(name(),
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm')"/>

<!-- Use the above calculated name as the NEW element's name -->
<xsl:element name="{$rot13_elem}">

<!-- Do the same for all attributes for this element -->
<xsl:for-each select="@*">
<xsl:variable name="rot13_attr"
select="translate(name(),
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm')"/>
<xsl:attribute name="{$rot13_attr}"><xsl:value-of select="."/></xsl:attribute>
</xsl:for-each>

<!-- Process all children (elements and text) of the element -->
<xsl:apply-templates/>

</xsl:element>

</xsl:template>

Running this stylesheet against the whitespace-stripped XML document produces the following result tree:

<vaibvpr vai_ahz="A200106180013"><phfgbzre_ahz>5738135</phfgbzre_ahz><cnlzrag glcr="visa" pneq_ahz="1234567890"/><vgrzf><vgrz vgrz_ahz="098743"><dhnagvgl>1</dhnagvgl><havg_cevpr pheerapl="USD">12.95</havg_cevpr></vgrz><vgrz vgrz_ahz="321822"><dhnagvgl>4</dhnagvgl><havg_cevpr pheerapl="USD">895.95</havg_cevpr></vgrz></vgrzf></vaibvpr>

This isn't perfectly opaque yet; the same kind of transformation should be applied to the element text content and attribute values (especially the "visa" and "USD"). But it's darned close to the desired level of senselessness. And to convert it back to its meaningful form at the other end, all you've got to do is run the same or similar XSLT transformation against the result.

(If you really want to do something like this to your own XML documents, I don't seriously suggest using the ROT-13 algorithm. It's all right as a demonstration, but would be easily cracked by any respectable decryption tool.)

So that's my anniversary unasked questions column. Next month, it's back to questions someone out there really needs to have answered.