Special Characters, Database Mappings
Q: My understanding of "well-formedness" in XML terms is
that characters such as & (ampersand) and
' (apostrophe or single quote) are not allowed in the
body of XML text. So the following phrase is illegal, although
grammatically correct:
Our organization's approach to staff recruitment
& retention is...
This seems crazy: millions of databases will have these characters somewhere within their text, and it doesn't make sense that all XSLT transformations are just waiting to fail.
I have also read that the only way stop my XSLT from crashing when it finds an ampersand or a quote at this point is to ask the developer of the source XML to ensure that it includes a CDATA statement. This too is worrying as it means my project to load data to my database is now dependent on persuading another overstretched developer to work on my project.
Is there anything I can do in my stylesheet or elsewhere to ensure that my XSLT will work even if the source data contains an ampersand or apostrophe?
A: The XML Q&A column has dealt before with the issue of
special characters -- ampersands, < and
> symbols, and so on. (See, for example, "Strange
Transformations"; "DTDs, Industry
Markup Languages, and Special Characters"; and "Handling Special
Content." Also be sure to read the reader comments on those
columns, which may provide other alternatives or insights.) I won't
repeat here the reasons why these characters are set aside as
"special." But in general, you raise two concerns. Let's see
if I can restate them briefly:
Issue #1: I do see the worry. But it's demonstrably false. XML has been around for over five years and embraced in that time by a broad spectrum of Web-application vendors (all of whom care about existing Web sites and many of whom care about interacting with databases) and database vendors and developers (all of whom care about databases, obviously, and nearly all of whom care about the database-to-Web issues). XSLT has been around nearly as long as XML itself and happily coexists in both the Web and database universes.
Which doesn't really answer your concern, although I hope it allays it a little. Put yet another way, the concern might be phrased thus: Why does XSLT require source trees to be well-formed XML? The answer is that XSLT was developed to solve the problem of turning an XML source tree into a result tree, which may be a document in the same or a different XML vocabulary or even just plain text (e.g., a flat file of comma-separated values). The W3C Working Group (WG) with responsibility for specifying XSLT wasn't charged with turning an arbitrary source document -- one unrecognizable as XML -- into such a result tree. I suspect if they had been so charged, they'd still be bickering over the spec's Introduction.
What you can do is this: build (or acquire) a preprocessor to read in the not-quite-XML source, converting all the literal ampersands and what-not into their entity-reference equivalents and writing out true XML. It will require a little up-front work but eliminates the long-term worry.
Issue #2: Yes, an XSLT processor, just like any other application which expects legitimate XML as input, will choke on ampersands, less-than symbols, and so on instead of their entity-reference forms.
If you use a GUI-based XHTML or HTML editor, you may have noticed
that you're free to enter any old character into a document, even the
"dangerous" markup-significant ones. What's more, the editor
even shows you a literal ampersand, instead of something horrible like
&. If you examine the raw source behind the GUI
cosmetics, though, you'll find entity references scattered around even
though you well know you didn't key them in yourself. The editor is in
effect mediating between the markup- and non-markup-based worlds in
the same way that your preprocessor would need to do.
The point: no, you don't need to wrap your special-character content in CDATA sections. You only need to do so if for some reason the entity-reference forms of these characters are absolutely not available, by any means, at the time of the XML source document's creation.
By the way, the number of restricted characters is really quite
small. The ampersand and less-than symbols -- both of which signal to
an XML processor, "Markup ahead...!" -- are pretty much the
only ones you need to watch out for. The restriction on the
greater-than symbol is provided for compatibility with older SGML
applications; if you're not using such an application, you don't need
to worry about stray >s. As for the single and double
quotes in plain old element (i.e., text node) content, you needn't
escape them at all. The entity-reference forms are provided for use in
attribute values, and even there aren't outright required. See section
2.4 of the XML spec, "Character Data and
Markup," for the details.)
In last
month's column I covered some things you might consider when
moving from a database to XML or vice versa. The particular focus was
on database field and column names containing characters which are
illegal in XML. For example, a field name might be
Grantee's_ID_#, which can't be carried over directly to
an XML element name because of the apostrophe. The questioner's
solution was to "sanitize" the database names by removing
illegal characters or by replacing them with legal ones. This solution
wasn't perfect, though. In particular, it led to the prospect of
duplicate identifier names.
Ultimately, I suggested a convention which would relegate each field's name to an attribute value, since they are free of the constraints placed on XML identifier names.
In a comment posted shortly after that column's publication, technical writer, editor, and Oracle database guru Jonathan Gennick directed me to an emerging ISO/ANSI standard called SQLX. Billed as the place "where SQL meets XML,", SQLX is a joint effort by representatives of IBM, Oracle, Sybase, Microsoft, and Northrop-Grumman to establish a standard for the "ways in which Database Language SQL can be used in conjunction with XML." As Gennick says, "no need to reinvent the wheel" by proposing some alternative method of mapping identifiers.
|
Also in XML Q&A | |
In the current
working draft of the SQLX spec, chapter 7 (beginning on page 31)
covers the mapping of identifiers between database and XML data
sources. The SQLX approach to the illegal-characters dilemma is
technically clever, although it will likely cause befuddlement in some
circles. The basic idea starts with recognizing that every character
-- including apostrophes, ampersands and the like -- can be
represented by a hexadecimal Unicode value. For example, a lowercase
a can be represented as a Unicode x0061,
capital N as x004e, and so on.
Given a database identifier like Grantee's_ID_#, SQLX
proposes first that all the XML-legal characters be mapped
directly. The XML-illegal characters would then be mapped using their
Unicode equivalents, preceded and followed by an underscore. Thus, the
Grantee's_ID_# field in a database would
become a Grantee_x0027_s_ID_# element in
the XML-document counterpart.
Many thanks to Gennick for the heads-up on SQLX.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.