Special Characters, Database Mappings

February 26, 2003

I Must Have Ampersands and Apostrophes!

Q: My understanding of "well-formedness" in XML terms is that characters such as & (ampersand) and ' (apostrophe or single quote) are not allowed in the body of XML text. So the following phrase is illegal, although grammatically correct:

Our organization's approach to staff recruitment & retention is...

This seems crazy: millions of databases will have these characters somewhere within their text, and it doesn't make sense that all XSLT transformations are just waiting to fail.

I have also read that the only way stop my XSLT from crashing when it finds an ampersand or a quote at this point is to ask the developer of the source XML to ensure that it includes a CDATA statement. This too is worrying as it means my project to load data to my database is now dependent on persuading another overstretched developer to work on my project.

Is there anything I can do in my stylesheet or elsewhere to ensure that my XSLT will work even if the source data contains an ampersand or apostrophe?

A: The XML Q&A column has dealt before with the issue of special characters -- ampersands, < and > symbols, and so on. (See, for example, "Strange Transformations"; "DTDs, Industry Markup Languages, and Special Characters"; and "Handling Special Content." Also be sure to read the reader comments on those columns, which may provide other alternatives or insights.) I won't repeat here the reasons why these characters are set aside as "special." But in general, you raise two concerns. Let's see if I can restate them briefly:

XML's restrictive use of certain special characters will prevent XSLT's use with existing Web sites and database applications.
An XSLT transformation will break when it encounters any of these special characters in the source tree -- unless they are wrapped in a CDATA section.

Issue #1: I do see the worry. But it's demonstrably false. XML has been around for over five years and embraced in that time by a broad spectrum of Web-application vendors (all of whom care about existing Web sites and many of whom care about interacting with databases) and database vendors and developers (all of whom care about databases, obviously, and nearly all of whom care about the database-to-Web issues). XSLT has been around nearly as long as XML itself and happily coexists in both the Web and database universes.

Which doesn't really answer your concern, although I hope it allays it a little. Put yet another way, the concern might be phrased thus: Why does XSLT require source trees to be well-formed XML? The answer is that XSLT was developed to solve the problem of turning an XML source tree into a result tree, which may be a document in the same or a different XML vocabulary or even just plain text (e.g., a flat file of comma-separated values). The W3C Working Group (WG) with responsibility for specifying XSLT wasn't charged with turning an arbitrary source document -- one unrecognizable as XML -- into such a result tree. I suspect if they had been so charged, they'd still be bickering over the spec's Introduction.

What you can do is this: build (or acquire) a preprocessor to read in the not-quite-XML source, converting all the literal ampersands and what-not into their entity-reference equivalents and writing out true XML. It will require a little up-front work but eliminates the long-term worry.

Issue #2: Yes, an XSLT processor, just like any other application which expects legitimate XML as input, will choke on ampersands, less-than symbols, and so on instead of their entity-reference forms.

If you use a GUI-based XHTML or HTML editor, you may have noticed that you're free to enter any old character into a document, even the "dangerous" markup-significant ones. What's more, the editor even shows you a literal ampersand, instead of something horrible like &. If you examine the raw source behind the GUI cosmetics, though, you'll find entity references scattered around even though you well know you didn't key them in yourself. The editor is in effect mediating between the markup- and non-markup-based worlds in the same way that your preprocessor would need to do.

The point: no, you don't need to wrap your special-character content in CDATA sections. You only need to do so if for some reason the entity-reference forms of these characters are absolutely not available, by any means, at the time of the XML source document's creation.

By the way, the number of restricted characters is really quite small. The ampersand and less-than symbols -- both of which signal to an XML processor, "Markup ahead...!" -- are pretty much the only ones you need to watch out for. The restriction on the greater-than symbol is provided for compatibility with older SGML applications; if you're not using such an application, you don't need to worry about stray >s. As for the single and double quotes in plain old element (i.e., text node) content, you needn't escape them at all. The entity-reference forms are provided for use in attribute values, and even there aren't outright required. See section 2.4 of the XML spec, "Character Data and Markup," for the details.)

Database/XML mapping redux

In last month's column I covered some things you might consider when moving from a database to XML or vice versa. The particular focus was on database field and column names containing characters which are illegal in XML. For example, a field name might be Grantee's_ID_#, which can't be carried over directly to an XML element name because of the apostrophe. The questioner's solution was to "sanitize" the database names by removing illegal characters or by replacing them with legal ones. This solution wasn't perfect, though. In particular, it led to the prospect of duplicate identifier names.

Ultimately, I suggested a convention which would relegate each field's name to an attribute value, since they are free of the constraints placed on XML identifier names.

In a comment posted shortly after that column's publication, technical writer, editor, and Oracle database guru Jonathan Gennick directed me to an emerging ISO/ANSI standard called SQLX. Billed as the place "where SQL meets XML,", SQLX is a joint effort by representatives of IBM, Oracle, Sybase, Microsoft, and Northrop-Grumman to establish a standard for the "ways in which Database Language SQL can be used in conjunction with XML." As Gennick says, "no need to reinvent the wheel" by proposing some alternative method of mapping identifiers.

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

In the current working draft of the SQLX spec, chapter 7 (beginning on page 31) covers the mapping of identifiers between database and XML data sources. The SQLX approach to the illegal-characters dilemma is technically clever, although it will likely cause befuddlement in some circles. The basic idea starts with recognizing that every character -- including apostrophes, ampersands and the like -- can be represented by a hexadecimal Unicode value. For example, a lowercase a can be represented as a Unicode x0061, capital N as x004e, and so on.

Given a database identifier like Grantee's_ID_#, SQLX proposes first that all the XML-legal characters be mapped directly. The XML-illegal characters would then be mapped using their Unicode equivalents, preceded and followed by an underscore. Thus, the Grantee's_ID_# field in a database would become a Grantee_x0027_s_ID_# element in the XML-document counterpart.

Many thanks to Gennick for the heads-up on SQLX.