Nobody REALLY Asked Me, But...
John E. Simpson is the author of XPath and XPointer
Last August, for the one-year anniversary of my stint as the XML Q&A columnist, I wrote a piece called "Nobody Asked Me, But..." It covered some rather strange questions that no one, to my knowledge, had submitted to the oreillynet.com XML forum. One of those bogus questions asked how to obfuscate the markup in an XML document; while its structure would remain intact (elements nested inside of elements, and so on), the "meaning" of its elements and attributes would no longer be plain.
Then, a couple months ago, I heard from a reader named James. He liked that question and its answer but wanted to know how he could take it further. In this two-year anniversary column, I'll deal with James's question and some others it raises.
Q: How can I use XSLT to mask not only the markup, but the content of my XML document?
A: Just as a reminder, here was the sample XML document from last year's column:
<invoice inv_num="A200106180013"> <customer_num>5738135</customer_num> <payment type="visa" card_num="1234567890"/> <items> <item item_num="098743"> <quantity>1</quantity> <unit_price currency="USD">12.95</unit_price> </item> <item item_num="321822"> <quantity>4</quantity> <unit_price currency="USD">895.95</unit_price> </item> </items> </invoice>
After applying the XSLT transformation using the simple ROT-13 "encoding" described in last year's column, this document looked as follows (retaining whitespace for something approaching legibility):
<vaibvpr vai_ahz="A200106180013"> <phfgbzre_ahz>5738135</phfgbzre_ahz> <cnlzrag glcr="visa" pneq_ahz="1234567890"/> <vgrzf> <vgrz vgrz_ahz="098743"> <dhnagvgl>1</dhnagvgl> <havg_cevpr pheerapl="USD">12.95</havg_cevpr> </vgrz> <vgrz vgrz_ahz="321822"> <dhnagvgl>4</dhnagvgl> <havg_cevpr pheerapl="USD">895.95</havg_cevpr> </vgrz> </vgrzf> </vaibvpr>
While this is pretty murky, there are some content cues as to what's being obscured. For example, the "USD" attribute values are pretty good tipoffs that some kind of money is involved. And the numbers -- including the all-important credit card account -- are plainly transparent.
Before tackling James's question outright, consider some of the problems it raised.
What's a "character"?
The ROT-13 algorithm I provided last year dealt with letters only. This
sort of works for documents as simple as the one I presented last year. It
fails to consider that an XML element or attribute's name may consist of
many more characters than the mere 52 (26 uppercase, 26 lowercase) letters
in the so-called Roman alphabet. It may also contain any of the digits
0-9, as well as a restricted set of punctuation (the underscore and colon
-- respectively, the
characters). Potentially much worse for a general-purpose routine, the
"letters" in the name can come from any of a complete universe of
characters outside the Roman alphabet, as well. (See the XML
Recommendation for the
complete list of legitimate "letters," presented as Unicode
Thus, a complete and general-purpose ROT-13 routine would need to take into consideration such markup as
<aperçu2001> außer_Betrieb="2002-08-28" <υλικώ>
What does it mean for an XML name to be "well-formed"?
Aside from the whole class of characters which may be used in an XML
name, there are restrictions on where in the name certain kinds of
characters may appear. For instance, letters can be used anywhere at all
in the name; digits, on the other hand, can appear anywhere except as the
first character. Colons are supposed to be reserved for use with namespace
prefixes, and both the prefix and the portion which follows the colon must
separately follow the rules for XML names:
jes:elem21 is a
jes:21elem is not.
Why bring this up at all? If the source document to be "encoded" is already at least well-formed, as determined by a parser, then why does the ROT-13 algorithm need to worry? The answer is that the result document needs to be well-formed, too -- at least, if it is to be "decoded" using XSLT at the other end of the transaction. Otherwise, the XSLT processor won't even get to examine it; it will have been rejected out of hand by the lower-level XML parser. Which leads me to the next consideration...
Can I hide the markup characters -- <, >, & -- as well as the XML names and content?
Again, if the recipient of the ROT-13 "encoded" message will be "decoding" it using XSLT, the answer is no. An XSLT transformation operates only on XML input. It's true that the transformation can output plain text such as this:
This is the first tag from the sample document, with square brackets in place of angle brackets, an asterisk standing in for the underscore, a caret replacing the equals sign, and exclamation points instead of quotation marks. Doesn't look like markup, does it? Your XML parser will agree, enthusiastically.
Pages: 1, 2