Menu

Strange Transformations

April 24, 2002

John E. Simpson

Q: Can I un-CDATA my CDATA section?

I have some HTML tags embedded in a CDATA section. (I didn't write the source document!) When my XSLT translates the document into HTML for a browser, the tags in the CDATA section marked as <i> are delivered to the browser as &lt;i&gt;.

Is there anything I can do to prevent this translation?

A: You didn't provide a sample document fragment, but I assume from your description that you have to deal with something like the following in your source document:

<true_xmlwrapper>

   <![CDATA[

      <html>

         <head><title>Weird Embedded

Markup</title></head>

         <body>

            <h1>Someone thought he was being clever...!</h1>

            <p><em>[etc.]</em></p>

         </body>

      </html>

   ]]>

</true_xmlwrapper>

I assume further that what you'd hope to transform the above into -- the result tree -- would be something like:

<html>

   <head><title>Weird Embedded Markup</title></head>

   <body>

      <h1>Someone thought he was being clever...!</h1>

      <p><em>[etc.]</em></p>

   </body>

</html>

And, if these assumptions are correct, you've probably got an XSLT stylesheet with a template rule such as:

<xsl:template match="true_xmlwrapper">

   <xsl:value-of select="."/>

</xsl:template>

As you've probably discovered, this solves one problem -- it subtracts the opening and closing <![CDATA[ and ]]> delimiters. What it writes out to the result tree, though, isn't the desired nice and neat HTML code but rather the quite ugly

&lt;html&gt;

   &lt;head&gt;&lt;title&gt;Weird Embedded

Markup&lt;/title&gt;&lt;/head&gt;

   &lt;body&gt;

     &lt;h1&gt;Someone thought he was being clever...!&lt;/h1&gt;

    

&lt;p&gt;&lt;em&gt;[etc.]&lt;/em&gt;&lt;/em&gt;



   &lt;/body&gt;

&lt;/html&gt;

There is a strange kind of correspondence between the desired and actual results. What the actual result tree is saying might be translated as "The angle brackets in the following lines are not to be treated as markup delimiters, but as literal characters." And guess what? That's exactly how the CDATA section in this (or any other) source document suggests markup-significant characters should be treated. Whoever created that document evidently imagined him or herself to be doing the downstream application a favor -- as though by shrouding the embedded HTML markup in a CDATA section it was protected from tampering by alien forces (like one of those blasted XSLT processors). In fact, what wrapping in CDATA did was to announce to any markup-aware application, "This looks like markup but really isn't -- it's not even HTML." Under the circumstances, the assumptions made by the XSLT processor are quite reasonable.

All that said, here's something for you to try. (It's worked for me with both the MSXML and Saxon XSLT processors.) In your XSLT stylesheet, include this top-level element:

<xsl:output method="text"/>

This approach may seem counterintuitive, even weird. After all, if the problem resides in the input side of the transformation, what good would specifying the output's characteristics do?

But in the absence of any xsl:output element at all, the XSLT processor attempts to figure out the stylesheet's intentions by examining the result tree from the transformation. This figuring-out uses a series of tests whose purpose is to determine whether the result tree is HTML (and by default, the version is HTML 4.0, not XHTML); if not, the result tree is assumed to be a well-formed XML general parsed entity. (Such an entity may or may not be a well-formed document. For instance, the root node may contain two child elements.) The four tests of an HTML result tree (and all must be true) are

  • the result tree's root node has an element child (that is, it has a root element);
  • the local name of the root element (discounting any namespace prefix) is "html";
  • the root html element has no namespace URI associated with it; and
  • the only text nodes preceding the result tree's root element are whitespace-only text nodes.

In the case of a document like the one you describe, these tests are almost immaterial: no matter how much it looks like it contains markup, a CDATA section by definition contains only literal text. So by default, there is no "root element" in the above result tree, an html or anything else. There's just a string of literal characters which happens to start with a literal < character. Since the result tree fails the HTML test, the processor guesses the result tree is simply a well-formed general parsed entity -- consisting, in this case, of a single text node.

But by specifying method="text", you short-circuit the processor's default behaviors, instructing it not to make any assumptions at all about the nature of the result. 

(There are two dangers in using this little trick, by the way. First, it's global: you can't apply it selectively to some sections of the source/result trees but not to others. Second, and more importantly, if the "markup" within the CDATA section isn't well-formed, it will simply be passed without complaint to the result tree. If the downstream application meant to consume this result tree is XML- or HTML-aware, you may be faced with disastrous downstream complications.)

Q: I keep losing a trailing space inside my empty-element tags.

To keep my XHTML compatible with older browsers (like Netscape 4.77), my XSLT transformation includes a space before the trailing slash on empty XHTML elements, like this:

<xsl:template match="model/name">

   <em>Model Name: </em> 

   <xsl:apply-templates/><br />

            <!-- Note space ^ -->

</xsl:template>

However, the transformation ends up looking something like

<em>Model Name: </em> Nimbus 2000<br/>
                      <!-- No space ^ -->

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

That's fine for newer browsers, but older browsers don't recognize <br/> as a <br> tag, and hence ignore it, which is just no good. I've looked at a number of techniques for controlling whitespace in XML (Bob DuCharme's series, for instance), but all of these techniques focus on the content of elements, not the element tags themselves. I recognize that XML has its reasons for handling whitespace the way it does, and that from an XML perspective trying to control whitespace within a tag is a little batty. But does anyone know of a workaround, short of fixing it with, say, a Perl script after the transformation?

A: A Perl script? After the transformation? <shudder/> I mean, I love Perl, but still.... There are a couple of approaches to resolve this issue.

First, remember that an empty element can be represented by a contiguous start tag/end tag pair, like:

<br></br>

So you may be able to put this into the result tree instead of the empty-tag form, <br/> (with or without the space before the slash). One problem with this solution is that some versions of older browsers may interpret this as two br elements in sequence.

A better solution is a variation of the answer to the first question in this month's column. As I described above, the XSLT processor makes an educated guess about the result tree. I don't know why this educated guess is failing to recognize your result tree as HTML 4.0 (which is readable by both older and newer browsers). But you can force the interpretation with this top-level element:

<xsl:output method="html"/>

In this case, for instance, when your stylesheet includes an XML-compliant <br/> tag (again, with or without the space), a compliant processor will output it in the HTML-compliant <br> form.

I realize this may introduce an unwanted wrinkle to your problem; it forces the result tree to be not XHTML, just plain old dumb HTML 4.0. Unfortunately we're at a transitional stage in both browser and XHTML development. If I were you, I'd leverage the still-forgiving nature of the newer browsers rather than coding to XHTML strict standards and hoping that older browsers will somehow function as expected. (They often didn't comply with standards in place at the time the browsers were built; it's no wonder they adhere to newer standards even less rigorously.)