April 24, 2002
Q: Can I un-CDATA my CDATA section?
I have some HTML tags embedded in a CDATA section. (I didn't write the source document!)
When my XSLT translates the document into HTML for a browser, the tags in the CDATA
<i> are delivered to the browser as
Is there anything I can do to prevent this translation?
A: You didn't provide a sample document fragment, but I assume from your description that you have to deal with something like the following in your source document:
<true_xmlwrapper> <![CDATA[ <html> <head><title>Weird Embedded Markup</title></head> <body> <h1>Someone thought he was being clever...!</h1> <p><em>[etc.]</em></p> </body> </html> ]]> </true_xmlwrapper>
I assume further that what you'd hope to transform the above into -- the result tree -- would be something like:
<html> <head><title>Weird Embedded Markup</title></head> <body> <h1>Someone thought he was being clever...!</h1> <p><em>[etc.]</em></p> </body> </html>
And, if these assumptions are correct, you've probably got an XSLT stylesheet with a template rule such as:
<xsl:template match="true_xmlwrapper"> <xsl:value-of select="."/> </xsl:template>
As you've probably discovered, this solves one problem -- it subtracts the opening
]]> delimiters. What it writes out to
the result tree, though, isn't the desired nice and neat HTML code but rather the
<html> <head><title>Weird Embedded Markup</title></head> <body> <h1>Someone thought he was being clever...!</h1> <p><em>[etc.]</em></em> </body> </html>
There is a strange kind of correspondence between the desired and actual results. What the actual result tree is saying might be translated as "The angle brackets in the following lines are not to be treated as markup delimiters, but as literal characters." And guess what? That's exactly how the CDATA section in this (or any other) source document suggests markup-significant characters should be treated. Whoever created that document evidently imagined him or herself to be doing the downstream application a favor -- as though by shrouding the embedded HTML markup in a CDATA section it was protected from tampering by alien forces (like one of those blasted XSLT processors). In fact, what wrapping in CDATA did was to announce to any markup-aware application, "This looks like markup but really isn't -- it's not even HTML." Under the circumstances, the assumptions made by the XSLT processor are quite reasonable.
All that said, here's something for you to try. (It's worked for me with both the MSXML and Saxon XSLT processors.) In your XSLT stylesheet, include this top-level element:
This approach may seem counterintuitive, even weird. After all, if the problem resides in the input side of the transformation, what good would specifying the output's characteristics do?
But in the absence of any
xsl:output element at all, the XSLT processor
attempts to figure out the stylesheet's intentions by examining the result tree from
transformation. This figuring-out uses a series of tests whose purpose is to determine
whether the result tree is HTML (and by default, the version is HTML 4.0, not XHTML);
if not, the result tree is assumed to be a well-formed XML general parsed entity.
entity may or may not be a well-formed document. For instance, the root node may
contain two child elements.) The four tests of an HTML result tree (and all must be
- the result tree's root node has an element child (that is, it has a root element);
- the local name of the root element (discounting any namespace prefix) is "html";
- the root
htmlelement has no namespace URI associated with it; and
- the only text nodes preceding the result tree's root element are whitespace-only text nodes.
In the case of a document like the one you describe, these tests are almost immaterial:
matter how much it looks like it contains markup, a CDATA section by definition contains
only literal text. So by default, there is no "root element" in the above result tree,
html or anything else. There's just a string of literal characters which
happens to start with a literal
< character. Since the result tree fails the
HTML test, the processor guesses the result tree is simply a well-formed general parsed
entity -- consisting, in this case, of a single text node.
But by specifying
method="text", you short-circuit the processor's default
behaviors, instructing it not to make any assumptions at all about the nature of the
(There are two dangers in using this little trick, by the way. First, it's global: you can't apply it selectively to some sections of the source/result trees but not to others. Second, and more importantly, if the "markup" within the CDATA section isn't well-formed, it will simply be passed without complaint to the result tree. If the downstream application meant to consume this result tree is XML- or HTML-aware, you may be faced with disastrous downstream complications.)
Q: I keep losing a trailing space inside my empty-element tags.
To keep my XHTML compatible with older browsers (like Netscape 4.77), my XSLT transformation includes a space before the trailing slash on empty XHTML elements, like this:
<xsl:template match="model/name"> <em>Model Name: </em> <xsl:apply-templates/><br /> <!-- Note space ^ --> </xsl:template>
However, the transformation ends up looking something like
<em>Model Name: </em> Nimbus 2000<br/>
<!-- No space ^ -->
Also in XML Q&A
That's fine for newer browsers, but older browsers don't recognize
<br> tag, and hence ignore it, which is just no good. I've looked at
a number of techniques for controlling whitespace in XML (Bob DuCharme's series, for instance),
but all of these techniques focus on the content of elements, not the element tags
themselves. I recognize that XML has its reasons for handling whitespace the way it
and that from an XML perspective trying to control whitespace within a tag is a
little batty. But does anyone know of a workaround, short of fixing it with, say,
script after the transformation?
A: A Perl script? After the transformation?
mean, I love Perl, but still.... There are a couple of approaches to resolve this
First, remember that an empty element can be represented by a contiguous start tag/end tag pair, like:
So you may be able to put this into the result tree instead of the empty-tag form,
<br/> (with or without the space before the slash). One problem with
this solution is that some versions of older browsers may interpret this as two
br elements in sequence.
A better solution is a variation of the answer to the first question in this month's column. As I described above, the XSLT processor makes an educated guess about the result tree. I don't know why this educated guess is failing to recognize your result tree as HTML 4.0 (which is readable by both older and newer browsers). But you can force the interpretation with this top-level element:
In this case, for instance, when your stylesheet includes an XML-compliant
<br/> tag (again, with or without the space), a compliant processor
will output it in the HTML-compliant
I realize this may introduce an unwanted wrinkle to your problem; it forces the result tree to be not XHTML, just plain old dumb HTML 4.0. Unfortunately we're at a transitional stage in both browser and XHTML development. If I were you, I'd leverage the still-forgiving nature of the newer browsers rather than coding to XHTML strict standards and hoping that older browsers will somehow function as expected. (They often didn't comply with standards in place at the time the browsers were built; it's no wonder they adhere to newer standards even less rigorously.)