Menu

Named Character Elements for XML

January 2, 2003

Anthony Coates and Zarella Rendon

Introduction

HTML users are used to having a lot of named character entities available. They can use " " to insert a non-breaking space, "©" to insert a copyright symbol, and "€" to insert the symbol for the new European currency, the Euro. However, most symbols are not automatically defined in XML. To make them available, you have to use a DTD that defines them or you have to define them in the internal DTD subset of your document. Either way, you need to have a DOCTYPE declaration in your XML documents, which is not appropriate for documents which only need to be well-formed, and for which a DTD would only add more work without creating more value.

In particular, XML validation tools often use the presence of a DOCTYPE declaration as an indication that DTD validation should be used. This is a problem if you are using W3C XML Schemas for your validation and only need a DOCTYPE for the purposes of enabling named characters in your XML documents. It means that you will not get the validation results you expect. This is the kind of problem that can lead to a lot of wasted time for inexperienced XML users.

In a recent thread on the xml-dev mailing list, the W3C XML Core Working Group expressed its opinion that XML does not need a new mechanism for providing named character entities. The XML Core WG stated that, if you want to use named entities, you simply have to define them in a DTD or in the internal DTD subset. However, Tim Bray offered an alternative suggestion: the right way to give human readable names to special characters was to define XML elements for them. You can then process these elements at the last moment and replace them with the appropriate numeric character entities. The disadvantage to this approach is that it only works with element content, not with attribute values. However, it does allow you to work with purely well-formed XML, without any DTD or DOCTYPE required.

The xmlchar XSLT Library

Following Tim's suggestion, xmlchar is an XSLT library which provides named elements for all of the character entities defined in HTML 4. For example, the following XML file contains the xmlchar elements for the currency symbols for the British Pound (£) and the European Euro (€). It also includes some non-breaking spaces to provide double-spacing between sentences:

<html xmlns:ch="http://xmlchar.sf.net/ns#">
  <body>
    <p>My sandwich cost <ch:pound/>2.</p>
    <p>Really?<ch:nbsp/> You were cheated.<ch:nbsp/>
       My sandwich only cost <ch:euro/>2.</p>
  </body>
</html>

When you apply the xmlchar stylesheets to such a document, you get this HTML result, which has the correct HTML "&pound;", "&euro;", and "&nbsp;" entities.

<html>
  <body>
    <p>My sandwich cost &pound;2.</p>
    <p>Really?&nbsp; You were cheated.&nbsp;
       My sandwich only cost &euro;2.</p>
  </body>
</html>

The xmlchar stylesheets are designed to be used in combination with your existing xslt stylesheets. Simply use <xsl:import> to call in the html4-all.xsl stylesheet. Any character elements you've added to your document will be converted to the appropriate character entity in the output. The stylesheet that was used for this example is

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:import href="xmlchar-1.1/html4-all.xsl"/>
  <xsl:param name="xmlcharNsUri"
    select="'http://xmlchar.sf.net/ns#'"/>
  <xsl:output method="html"/>

  <!-- You need to explicitly call the imported xmlchar -->
  <!-- templates if they would be overridden by the     -->
  <!-- templates in the importing file,                 -->
  <!-- as is the case here.                             -->
  <!-- This can always be avoided, though.              -->
  <xsl:template match="*[namespace-uri() = $xmlcharNsUri]">
    <xsl:apply-imports/>
  </xsl:template>

  <!-- Everything from here copies the non-xmlchar     -->
  <!-- content unchanged from the input to the output. -->
  <xsl:template
    match="text()|comment()|processing-instruction()">
    <xsl:copy/>
  </xsl:template>
  <xsl:template match="*">
    <xsl:text/>
    <xsl:element name="{name()}"
      namespace="{namespace-uri()}">
      <xsl:for-each select="@*">
        <xsl:copy/>
      </xsl:for-each>
      <xsl:apply-templates/>
    </xsl:element>
  </xsl:template>
</xsl:transform>

Preserving xmlchar elements

When an XML document passes through a series of processing stages, the xmlchar elements will normally be preserved until the final stage. If your XSLT stylesheets have been designed to copy elements by default, then the xmlchar elements will be preserved as required.

Otherwise, you can import the xmlchar "copy" stylesheets to make sure that xmlchar elements are copied from input to output. Consider the following example:

<html xmlns:ch="http://xmlchar.sf.net/ns#">
  <head>
    <title>xmlchar 1.0 - Test</title>
  </head>
  <body>
    <h1>xmlchar 1.0 Test</h1>
    <p>My sandwich cost <ch:pound/>2.</p>
    <p>Really?<ch:nbsp/> You were cheated.<ch:nbsp/>
      My sandwich only cost <ch:euro/>2.</p>
  </body>
</html>

This is translated from XHTML+xmlchar into DocBook+xmlchar using the following stylesheet. However, since DocBook and XHTML do not use the same elements, the stylesheet cannot be set to copy content by default, so xmlchar elements will have to be mapped explicitly. To do this, the xmlchar "copy" stylesheet html4-all-copy.xsl is imported into the stylesheet.

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:import href="xmlchar-1.1/html4-all-copy.xsl"/>
  <xsl:output method="xml"/>
  <xsl:template match="/html">
    <xsl:text/>
    <article xmlns:ch="http://xmlchar.sf.net/ns#">
      <title>
        <xsl:apply-templates select="/html/head/title/node()"/>
      </title>
      <xsl:for-each select="/html/body/h1">
        <section>
          <title>
            <xsl:apply-templates select="current()/node()"/>
          </title>
          <xsl:variable name="theTitle" select="."/>
          <xsl:apply-templates
            select = "/html/body/*[preceding-sibling::h1[text() = $theTitle]]"/>
        </section>
      </xsl:for-each>
    </article>
  </xsl:template>
  <xsl:template match="p">
    <para>
      <xsl:apply-templates/>
    </para>
  </xsl:template>
</xsl:transform>

The result is

<article xmlns:ch="http://xmlchar.sf.net/ns#">
<title>xmlchar 1.0 - Test</title>
<section>
<title>xmlchar 1.0 Test</title>
<para>My sandwich cost <ch:pound/>2.</para>
<para>Really?<ch:nbsp/> You were cheated.<ch:nbsp/>
  My sandwich only cost <ch:euro/>2.</para>
</section>
</article>

in which the xmlchar elements have been preserved as required. In a later part of the document process, the xmlchar elements would be transformed into character entities to produce standard DocBook output.

Converting legacy documents with character entities

If you want to convert legacy XML documents containing named HTML 4 character entities to use the xmlchar elements instead, you can use the xmlchar entity definitions. These expand the entities into their matching xmlchar elements.

Warning: the xmlchar entity definitions must not be used on XML documents that contain character entities in attribute values. Doing so will produce ill-formed XML.

Consider the following legacy document, which contains named character entities. The document is modified to import the xmlchar entities.

<!DOCTYPE html [
<!ENTITY % html.4.entities
  SYSTEM "xmlchar-1.1/html4-all.ent">
%html.4.entities;
]>
<html xmlns:ch="http://xmlchar.sf.net/ns#">
  <head>
    <title>xmlchar 1.1 - Test</title>
  </head>
  <body>
    <h1>xmlchar 1.1 Test</h1>
    <p>My sandwich cost &pound;2.</p>
    <p>Really?&nbsp; You were cheated.&nbsp;
      My sandwich only cost &euro;2.</p>
  </body>
</html>

The entities will be expanded into xmlchar elements when this file is parsed. To show that it works, when this file is transformed by the following "copy-through" XSLT stylesheet

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" encoding="ISO-8859-1"/>
  <xsl:template match="/html">
    <xsl:text/>
    <html xmlns:ch="http://xmlchar.sf.net/ns#">
      <xsl:apply-templates/>
    </html>
  </xsl:template>
  <xsl:template
    match="text()|comment()|processing-instruction()">
    <xsl:copy/>
  </xsl:template>
  <xsl:template match="*">
    <xsl:text/>
    <xsl:element name="{name()}"
      namespace="{namespace-uri()}">
      <xsl:for-each select="@*">
        <xsl:copy/>
      </xsl:for-each>
      <xsl:apply-templates/>
    </xsl:element>
  </xsl:template>
</xsl:transform>

the result is

<html xmlns:ch="http://xmlchar.sf.net/ns#">
  <head>
    <title>xmlchar 1.1 - Test</title>
  </head>
  <body>
    <h1>xmlchar 1.1 Test</h1>
    <p>My sandwich cost <ch:pound/>2.</p>
    <p>Really?<ch:nbsp/> You were cheated.<ch:nbsp/>
      My sandwich only cost <ch:euro/>2.</p>
  </body>
</html>

where the character entities have been converted to xmlchar elements as required.

Conclusion

Named character elements provide a natural way to use named special characters in XML documents, although they only work for element content and not for attribute values. The xmlchar XSLT library provides element equivalents for all of the special characters from HTML 4.