Special Characters, Database Mappings
February 26, 2003
I Must Have Ampersands and Apostrophes!
Q: My understanding of "well-formedness" in XML terms is that
characters such as
& (ampersand) and
' (apostrophe or single
quote) are not allowed in the body of XML text. So the following phrase is illegal,
Our organization's approach to staff recruitment & retention
This seems crazy: millions of databases will have these characters somewhere within their text, and it doesn't make sense that all XSLT transformations are just waiting to fail.
I have also read that the only way stop my XSLT from crashing when it finds an ampersand or a quote at this point is to ask the developer of the source XML to ensure that it includes a CDATA statement. This too is worrying as it means my project to load data to my database is now dependent on persuading another overstretched developer to work on my project.
Is there anything I can do in my stylesheet or elsewhere to ensure that my XSLT will work even if the source data contains an ampersand or apostrophe?
A: The XML Q&A column has dealt before with the issue of special
characters -- ampersands,
> symbols, and so on. (See,
for example, "Strange
Transformations"; "DTDs, Industry Markup Languages, and Special Characters"; and "Handling Special Content."
Also be sure to read the reader comments on those columns, which may provide other
alternatives or insights.) I won't repeat here the reasons why these characters are
aside as "special." But in general, you raise two concerns. Let's see if I can
restate them briefly:
- XML's restrictive use of certain special characters will prevent XSLT's use with existing Web sites and database applications.
- An XSLT transformation will break when it encounters any of these special characters in the source tree -- unless they are wrapped in a CDATA section.
Issue #1: I do see the worry. But it's demonstrably false. XML has been around for over five years and embraced in that time by a broad spectrum of Web-application vendors (all of whom care about existing Web sites and many of whom care about interacting with databases) and database vendors and developers (all of whom care about databases, obviously, and nearly all of whom care about the database-to-Web issues). XSLT has been around nearly as long as XML itself and happily coexists in both the Web and database universes.
Which doesn't really answer your concern, although I hope it allays it a little. Put yet another way, the concern might be phrased thus: Why does XSLT require source trees to be well-formed XML? The answer is that XSLT was developed to solve the problem of turning an XML source tree into a result tree, which may be a document in the same or a different XML vocabulary or even just plain text (e.g., a flat file of comma-separated values). The W3C Working Group (WG) with responsibility for specifying XSLT wasn't charged with turning an arbitrary source document -- one unrecognizable as XML -- into such a result tree. I suspect if they had been so charged, they'd still be bickering over the spec's Introduction.
What you can do is this: build (or acquire) a preprocessor to read in the not-quite-XML source, converting all the literal ampersands and what-not into their entity-reference equivalents and writing out true XML. It will require a little up-front work but eliminates the long-term worry.
Issue #2: Yes, an XSLT processor, just like any other application which expects legitimate XML as input, will choke on ampersands, less-than symbols, and so on instead of their entity-reference forms.
If you use a GUI-based XHTML or HTML editor, you may have noticed that you're free
any old character into a document, even the "dangerous" markup-significant ones.
What's more, the editor even shows you a literal ampersand, instead of something horrible
&. If you examine the raw source behind the GUI cosmetics, though,
you'll find entity references scattered around even though you well know you didn't
in yourself. The editor is in effect mediating between the markup- and non-markup-based
worlds in the same way that your preprocessor would need to do.
The point: no, you don't need to wrap your special-character content in CDATA sections. You only need to do so if for some reason the entity-reference forms of these characters are absolutely not available, by any means, at the time of the XML source document's creation.
By the way, the number of restricted characters is really quite small. The ampersand
less-than symbols -- both of which signal to an XML processor, "Markup ahead...!"
-- are pretty much the only ones you need to watch out for. The restriction on the
greater-than symbol is provided for compatibility with older SGML applications; if
not using such an application, you don't need to worry about stray
for the single and double quotes in plain old element (i.e., text node) content, you
escape them at all. The entity-reference forms are provided for use in attribute values,
even there aren't outright required. See section 2.4 of the XML spec, "Character Data and Markup," for the
Database/XML mapping redux
In last month's column I covered
some things you might consider when moving from a database to XML or vice versa. The
particular focus was on database field and column names containing characters which
illegal in XML. For example, a field name might be
Grantee's_ID_#, which can't
be carried over directly to an XML element name because of the apostrophe. The questioner's
solution was to "sanitize" the database names by removing illegal characters or by
replacing them with legal ones. This solution wasn't perfect, though. In particular,
to the prospect of duplicate identifier names.
Ultimately, I suggested a convention which would relegate each field's name to an attribute value, since they are free of the constraints placed on XML identifier names.
In a comment posted shortly after that column's publication, technical writer, editor, and Oracle database guru Jonathan Gennick directed me to an emerging ISO/ANSI standard called SQLX. Billed as the place "where SQL meets XML,", SQLX is a joint effort by representatives of IBM, Oracle, Sybase, Microsoft, and Northrop-Grumman to establish a standard for the "ways in which Database Language SQL can be used in conjunction with XML." As Gennick says, "no need to reinvent the wheel" by proposing some alternative method of mapping identifiers.
Also in XML Q&A
In the current working draft of
the SQLX spec, chapter 7 (beginning on page 31) covers the mapping of identifiers
database and XML data sources. The SQLX approach to the illegal-characters dilemma
technically clever, although it will likely cause befuddlement in some circles. The
idea starts with recognizing that every character -- including apostrophes, ampersands
the like -- can be represented by a hexadecimal Unicode value. For example, a lowercase
a can be represented as a Unicode
x004e, and so on.
Given a database identifier like
Grantee's_ID_#, SQLX proposes first that all
the XML-legal characters be mapped directly. The XML-illegal characters would then
using their Unicode equivalents, preceded and followed by an underscore. Thus, the
Grantee's_ID_# field in a database would become a
Grantee_x0027_s_ID_# element in the XML-document
Many thanks to Gennick for the heads-up on SQLX.