Databases and Element Names

January 29, 2003

XML-illegal characters in database-legal field names?

Q: I have a database that contains "unconventional" field names (i.e., Book_&_Page, Grantee's_ID_#). These field names do not meet the requirements for element names, so I am forced to run them through a "sanitizing" function before naming the element nodes. The function replaces or removes the characters offensive to XML. So the Book_&_Page field might become the Book__Page element, and Grantee's_ID_ might become Grantees_ID_no because I've chosen to sanitize simply by removing the offending characters.

Unfortunately, this sanitizing process introduces the possibility that I could end up with elements with the same name, although in the database they are named differently. For example, if there were original fields named Book_&_Page and Book_:_Page, sanitizing by the rule of removing the character would result in Book__Page for both of these fields. (Changing the field names in the database is NOT an option.)

The initial dilemma is deciding what rules to apply to what characters in the sanitizing function. Coming up with a generic set of rules that could apply to multiple datasets over multiple systems seems risky. I am hoping that someone has come up with a way to "wrap" the element name so that these "illegal" characters can be included in the name. Maybe something along the lines of a CDATA section for the content of an element?

In lieu of this so-called "wrapper," there are workarounds -- such as the sanitizing function or using aliases for the field names. Do you have any ideas for how to deal with this?

A: As you've evidently been learning, moving data by hand between XML and relational databases can be frustrating, intellectually stimulating, maybe even rewarding. Mostly, though, it can leave you simply hungry for help.

For almost anything involving getting XML data into or out of databases (relational or otherwise), the commonly accepted starting point is Ron Bourret's XML and Databases site. (You might also want to take a look at his piece on Mapping DTDs to Databases, published in mid-2001 here on XML.com, for specific advice about relating the structure of your database to the structure of the XML documents it needs to emit or consume. Despite the feature's emphasis on -- and presumed familiarity with -- DTDs, it's a wonderfully concise yet complete summary of the issues you'll need to consider.)

As for your specific question, you already know, or at least sense, that there's no getting around the XML Recommendation's basic requirements for element names: the ampersands and pound signs have got to go.

Unfortunately, wrapping a document's content -- including its markup -- in one big CDATA section doesn't solve your problem. Consider a document like the following:

<Book_Info><![CDATA[

   <Grantee's_ID_#>12345</Grantee's_ID_#>

  

<Book_&_Page>1,29</Book_&_Page>]]>



</Book_Info>

This may fool a casual human reader into "seeing" a Book_Info root element with Grantee's_ID_# and Book_&_Page children -- into "seeing," that is, a hierarchical tree of element nodes. (You might consider such a document an example of the trompe l'oeil school of database-to-XML mapping.)

XML is "supposed" to be human-readable, you might argue, but it's also supposed to be readable by software -- XML parsers -- which can construct a hierarchy of nodes from the document. In the case of the above document, the tree will have a single node, the Book_Info root element, containing a string of text (PCDATA) which just happens to contain not only apostrophes, pound signs, and ampersands, but also unescaped greater- and less-than symbols. The parser won't "see" the structure implied by those markup-significance characters, because wrapping them in a CDATA section explicitly instructs the parser to ignore their markup significance.

You might be able to sidestep your difficulty by forgetting about a literal mapping of database field or column names to element names. That is, you could push the database field or column names into attribute values, assigning corresponding element names either arbitrarily or according to some more or less intelligible scheme. For instance, the above document might be recast (via a method like your "sanitizing" script) as:

<table tablename="Book_Info">

<!-- Above assumes table names may present same problem as

field names. -->

   <field1

fieldname="Grantee's_ID_#">12345</field1>

   <field2

fieldname="Book_&_Page">1,29</field2>

</table>

It would then be pretty straightforward to transform this document via XSLT into either a comma-separated values text file or even an SQL statement such as:

INSERT INTO Book_Info (Grantee's_ID_#,

Book_&_Page)

VALUES ("12345", "1,29")

A portion of a stylesheet to accomplish the latter might look like this (comments in boldface):

<!-- Tell the XSLT processor not to be concerned

with outputting

well-formed XML. -->

<xsl:output method="text"/>

<!-- ...and suppress the built-in rules for processing text

nodes

in the source tree - we want text in the output only when

we

explicitly place it there. -->

<xsl:template match="text()"/>



<xsl:template match="table">

<!-- This template rule really works only if there's a

single root

element in the source tree, which maps onto a single table in

the

resulting database. Other structures would almost certainly

require 

 you to build your SQL differently. --> 

    INSERT INTO <xsl:value-of select="@tablename"/> (

   <xsl:for-each select="*">

      <!-- Note test below for the name of the source tree

element

      name at this point. It's not necessary, strictly speaking;

this

      is just to ensure that we're not getting any other data

mixed in

      with the data destined for the database. -->

      <xsl:if test="starts-with(name(), 'field')">

         <xsl:value-of select="@fieldname"/>

         <!-- If this is any field except the last, append

a comma

         to the field name. -->

         <xsl:if test="position() != last()">,

</xsl:if>

      </xsl:if>

   </xsl:for-each>)

   <!-- Building the VALUES clause structured identically

to 

    building the field name list, above. -->

   VALUES (

   <xsl:for-each select="*">

      <xsl:if test="starts-with(name(), 'field')">

         "<xsl:value-of select="."/>"<xsl:if

test="position() != last()">, </xsl:if>

      </xsl:if>

   </xsl:for-each>)

</xsl:template>

(Note: I've added indenting and line breaks for legibility of the XSLT code; if you use the code exactly as above, the resulting SQL will be syntactically valid but look a little strange.)

Numeric datatypes in XML Schema?

Q: I've got a question about the numerical datatypes long, unsignedLong, int, unsignedInt, short, unsignedShort, byte and unsignedByte. Which of these datatypes are subsets of the datatypes float and double?

I know they are all subsets of the datatype decimal. But what's with float and double? Is every long value, unsignedLong value, and so on a valid float value or a valid double value? I think it's obvious that byte and short (signed and unsigned) should be valid float and double values. But I don't know if this is true with int and long (signed and unsigned).

A: For the definitive answer to this question, I turned to Eric van der Vlist, regular XML.com contributor and author of O'Reilly's XML Schema. He summarized it this way:

Against all that we have learned in our math classes, WXS [W3C XML Schema] considers that:

long, unsignedLong, int, unsignedInt, short, unsignedShort, byte, unsignedByte and friends are derived from the "decimal" primitive datatype.

float and double are two other primitive datatypes.

This is because the first category allows comparisons without any risk of error: there is an exact matching between the lexical space (i.e., what you see in the XML document) and the value space (i.e., what the schema processor understands depending on the datatype). But the second category is approximate. (When I write 1.23E4, I have very little chance that it can be expressed as powers of two, which is the value space for float and double.)

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

Another way of looking at this is that all decimal values can be exactly represented, yes, as floats and doubles. But the reverse is not true; there are quite a few (actually, an infinite number of) floats and doubles which cannot be exactly represented as decimals. The former's approximate quality clashes with the latter's exactitude.

Van der Vlist also added a bonus clarification:

You have the same problem with QName, which is not derived from string (as opposed to NCName, which is derived from string), because its value space is different.

In short, although WXS is the product of logical, rational minds, don't assume -- especially when considering datatypes, primitive and otherwise -- that all its principles will necessarily follow the logic of everyday common sense.