Nobody REALLY Asked Me, But...
John E. Simpson is the author of XPath and XPointer
Last August, for the one-year anniversary of my stint as the XML Q&A columnist, I wrote a piece called "Nobody Asked Me, But..." It covered some rather strange questions that no one, to my knowledge, had submitted to the oreillynet.com XML forum. One of those bogus questions asked how to obfuscate the markup in an XML document; while its structure would remain intact (elements nested inside of elements, and so on), the "meaning" of its elements and attributes would no longer be plain.
Then, a couple months ago, I heard from a reader named James. He liked that question and its answer but wanted to know how he could take it further. In this two-year anniversary column, I'll deal with James's question and some others it raises.
|
Related Reading
XSLT |
A: Just as a reminder, here was the sample XML document from last year's column:
<invoice inv_num="A200106180013">
<customer_num>5738135</customer_num>
<payment type="visa" card_num="1234567890"/>
<items>
<item item_num="098743">
<quantity>1</quantity>
<unit_price
currency="USD">12.95</unit_price>
</item>
<item item_num="321822">
<quantity>4</quantity>
<unit_price
currency="USD">895.95</unit_price>
</item>
</items>
</invoice>
After applying the XSLT transformation using the simple ROT-13 "encoding" described in last year's column, this document looked as follows (retaining whitespace for something approaching legibility):
<vaibvpr vai_ahz="A200106180013">
<phfgbzre_ahz>5738135</phfgbzre_ahz>
<cnlzrag glcr="visa" pneq_ahz="1234567890"/>
<vgrzf>
<vgrz vgrz_ahz="098743">
<dhnagvgl>1</dhnagvgl>
<havg_cevpr
pheerapl="USD">12.95</havg_cevpr>
</vgrz>
<vgrz vgrz_ahz="321822">
<dhnagvgl>4</dhnagvgl>
<havg_cevpr
pheerapl="USD">895.95</havg_cevpr>
</vgrz>
</vgrzf>
</vaibvpr>
While this is pretty murky, there are some content cues as to what's being obscured. For example, the "USD" attribute values are pretty good tipoffs that some kind of money is involved. And the numbers -- including the all-important credit card account -- are plainly transparent.
Before tackling James's question outright, consider some of the problems it raised.
The ROT-13 algorithm I provided last year dealt with letters only. This
sort of works for documents as simple as the one I presented last year. It
fails to consider that an XML element or attribute's name may consist of
many more characters than the mere 52 (26 uppercase, 26 lowercase) letters
in the so-called Roman alphabet. It may also contain any of the digits
0-9, as well as a restricted set of punctuation (the underscore and colon
-- respectively, the _ and :
characters). Potentially much worse for a general-purpose routine, the
"letters" in the name can come from any of a complete universe of
characters outside the Roman alphabet, as well. (See the XML
Recommendation for the
complete list of legitimate "letters," presented as Unicode
values.)
Thus, a complete and general-purpose ROT-13 routine would need to take into consideration such markup as
<aperçu2001>
außer_Betrieb="2002-08-28"
<υλικώ>
Aside from the whole class of characters which may be used in an XML
name, there are restrictions on where in the name certain kinds of
characters may appear. For instance, letters can be used anywhere at all
in the name; digits, on the other hand, can appear anywhere except as the
first character. Colons are supposed to be reserved for use with namespace
prefixes, and both the prefix and the portion which follows the colon must
separately follow the rules for XML names: jes:elem21 is a
legitimate name; jes:21elem is not.
Why bring this up at all? If the source document to be "encoded" is already at least well-formed, as determined by a parser, then why does the ROT-13 algorithm need to worry? The answer is that the result document needs to be well-formed, too -- at least, if it is to be "decoded" using XSLT at the other end of the transaction. Otherwise, the XSLT processor won't even get to examine it; it will have been rejected out of hand by the lower-level XML parser. Which leads me to the next consideration...
Again, if the recipient of the ROT-13 "encoded" message will be "decoding" it using XSLT, the answer is no. An XSLT transformation operates only on XML input. It's true that the transformation can output plain text such as this:
[invoice inv*num^!A200106180013!]
This is the first tag from the sample document, with square brackets in place of angle brackets, an asterisk standing in for the underscore, a caret replacing the equals sign, and exclamation points instead of quotation marks. Doesn't look like markup, does it? Your XML parser will agree, enthusiastically.
|
|
Related Reading
XPath and XPointer |
I want to cover an answer to James's question in two parts. The first part attempts to address some of the limitations of last year's answer -- allowing for characters beyond the basic 52 Roman-alphabet characters. (As always when I present code fragments here, understand that whitespace such as newlines and indenting are included for presentation purposes only. If your own source documents are "prettied up" this way, be sure to strip out the extraneous whitespace before submitting them to the XSLT stylesheet solution below.)
As a reminder, at the heart of the earlier ROT-13 solution were a
series of calls to the XPath translate() function. Each
looked something like this:
translate([string to encode],
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
"NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm")
What this does is process the input string (an element/attribute name,
in last year's piece), replacing each occurrence of a character in the
second argument with the corresponding character in the third:
A with N, b with o, and so
on.
There's no limit to the lengths of those second and third arguments. Thus, we can extend them to include, say, digits and punctuation marks, as in the following (additions in boldface, and please disregard the line breaks forced by this page's layout):
translate([string to encode],
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890,./;[]\
~!@#$%^&*()+`-={}|"<>?',
'NOPQRSTUVWXYZABCDEFGHIJKLMnopqrstuvwxyzabcdefghijklm
56789012345*()+`-={}|"<>?,./;[]\
~!@#$%^&')
Some comments about this modification -- first, I haven't attempted to
substitute digits for letters, or punctuation marks for either; this would
almost certainly end up producing a non-well-formed result tree. For the
same reason, in the portions of the second and third arguments which
pertain to punctuation, I've used entity references (such as
&) instead of markup-significant literal characters (such
as &). (The apostrophe causes special problems in this
case, by the way, because of its appearance in an attribute value; I've
simply excluded it from both the "translate from" and "translate to"
arguments.) I've omitted the underscore and colon from both the second and
third arguments, because I don't want to foul up the well-formedness of
the result tree in any document whose source tree includes those
characters in element or attribute names. Finally, note that I've included
a space as a "character" to be encoded (it appears before the tilde,
~, in the second and third arguments). This will help obscure
the breaks between words in the source tree. (All spaces in the source
tree will be replaced with right curly braces; conversely, all hyphens
will be replaced with spaces.)
Last year's question dealt only with encoding element and attribute
names. James wants to encode content, too. Let's start by setting up a
named template, which can be invoked for encoding both the markup names
and the content (attribute values and text nodes). Keeping the
translate() function in a named template like this eliminates
the need to repeat that ungainly-looking function call every time we need
it:
<xsl:template name="rotencode">
<xsl:param name="cleartext"/>
<xsl:value-of select="translate($cleartext,
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890,./;[]\
~!@#$%^&*()+`-={}|"<>?','NOPQRSTUVWXYZABCDEFGHIJKLM
nopqrstuvwxyzabcdefghijklm56789012345*()+`-={}|"<>?,./;[]\
~!@#$%^&')"/>
</xsl:template>
This named template takes one parameter, called
cleartext. Whenever a template in the stylesheet invokes the
named template, it will pass by way of this parameter the string to be
encoded.
For processing elements and their attributes, we now need one template rule. Note the three calls to the rotencode template rule (above) -- one for element names, one for attribute names, and one for attribute values:
<xsl:template match="*">
<!-- Assign a variable for the ROT-13 version of the
element's name -->
<xsl:variable name="rot13_elem">
<xsl:call-template name="rotencode">
<xsl:with-param name="cleartext"
select="name()"/>
</xsl:call-template></xsl:variable>
<!-- Use the above calculated name as the NEW element's
name -->
<xsl:element name="{$rot13_elem}">
<!-- Process each attribute for this element...
-->
<xsl:for-each select="@*">
<!-- Set up another variable for the ROT-13
attribute name -->
<xsl:variable name="rot13_attr">
<xsl:call-template name="rotencode">
<xsl:with-param name="cleartext"
select="name()"/>
</xsl:call-template>
</xsl:variable>
<!-- Create the attribute with the new name...
-->
<xsl:attribute name="{$rot13_attr}">
<!-- ...and encode its value, as well.
-->
<xsl:call-template name="rotencode">
<xsl:with-param name="cleartext"
select="."/>
</xsl:call-template></xsl:attribute>
</xsl:for-each>
<!-- Process all children (elements and text) of the
element -->
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
We also need a whole new template rule for processing all text nodes. This overrides XSLT's default handling of such nodes, which is simply to copy them straight to the source tree, unchanged:
<xsl:template match="text()">
<xsl:call-template name="rotencode">
<xsl:with-param name="cleartext"
select="."/>
</xsl:call-template>
</xsl:template>
When you apply this stylesheet to the sample "invoice" document, the XSLT processor produces a result tree which looks as follows:
<vaibvpr vai_ahz="N755651635568">
<phfgbzre_ahz>0283680</phfgbzre_ahz>
<cnlzrag glcr="ivfn" pneq_ahz="6789012345"/>
<vgrzf>
<vgrz vgrz_ahz="543298">
<dhnagvgl>6</dhnagvgl>
<havg_cevpr
pheerapl="HFQ">67(40</havg_cevpr>
</vgrz>
<vgrz vgrz_ahz="876377">
<dhnagvgl>9</dhnagvgl>
<havg_cevpr
pheerapl="HFQ">340(40</havg_cevpr>
</vgrz>
</vgrzf>
</vaibvpr>
A more document-like, less data-like document might be something like:
<excerpt author="T.S. Eliot">I should have been
a pair of ragged claws</excerpt>
This translates, using the improved algorithm, to:
<rkprecg nhgube="G(F({Ryvbg">V{fubhyq{unir{orra{n{cnve{bs{enttrq{pynjf</rkprecg>
As I said in last August's column, this may be pretty effective at stopping a casual reader of the document. But naturally, it falls down as soon as the reader recognizes the document's ROT-13 nature, because she can fairly easily build a "de-ROT-13" routine to turn the document back into its cleartext form.
|
Also in XML Q&A | |
Incidentally, continuing to discuss all this as "ROT-13" encoding is a little misleading. That name derived from the fact that 26 letters could be rotated 13 places to produce a simply coded result. What we've now got rotates 52 letters (including lower- and uppercase variants), 10 digits, and 30 punctuation characters. Thus, this form of the encoding might better be referred to as something like ROT-46, or maybe ROT-26,5,15. If you're interested in pursuing this further on your own, you could rotate the characters an arbitrary number of places -- perhaps driven by a global parameter whose value is passed in from outside the stylesheet.
Next month, once again, it's back to the world of real questions faced by real readers. And thanks, as always, for keeping those questions coming.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.