Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Named Character Elements for XML

by Anthony Coates, Zarella Rendon
January 02, 2003

Introduction

HTML users are used to having a lot of named character entities available. They can use " " to insert a non-breaking space, "©" to insert a copyright symbol, and "€" to insert the symbol for the new European currency, the Euro. However, most symbols are not automatically defined in XML. To make them available, you have to use a DTD that defines them or you have to define them in the internal DTD subset of your document. Either way, you need to have a DOCTYPE declaration in your XML documents, which is not appropriate for documents which only need to be well-formed, and for which a DTD would only add more work without creating more value.

In particular, XML validation tools often use the presence of a DOCTYPE declaration as an indication that DTD validation should be used. This is a problem if you are using W3C XML Schemas for your validation and only need a DOCTYPE for the purposes of enabling named characters in your XML documents. It means that you will not get the validation results you expect. This is the kind of problem that can lead to a lot of wasted time for inexperienced XML users.

In a recent thread on the xml-dev mailing list, the W3C XML Core Working Group expressed its opinion that XML does not need a new mechanism for providing named character entities. The XML Core WG stated that, if you want to use named entities, you simply have to define them in a DTD or in the internal DTD subset. However, Tim Bray offered an alternative suggestion: the right way to give human readable names to special characters was to define XML elements for them. You can then process these elements at the last moment and replace them with the appropriate numeric character entities. The disadvantage to this approach is that it only works with element content, not with attribute values. However, it does allow you to work with purely well-formed XML, without any DTD or DOCTYPE required.

Related Reading

XSLT Cookbook
Solutions and Examples for XML and XSLT Developers
By Sal Mangano

Table of Contents
Index
Sample Chapter

Read Online--Safari Search this book on Safari:
 

Code Fragments only

The xmlchar XSLT Library

Following Tim's suggestion, xmlchar is an XSLT library which provides named elements for all of the character entities defined in HTML 4. For example, the following XML file contains the xmlchar elements for the currency symbols for the British Pound (£) and the European Euro (€). It also includes some non-breaking spaces to provide double-spacing between sentences:

<html xmlns:ch="http://xmlchar.sf.net/ns#">
  <body>
    <p>My sandwich cost <ch:pound/>2.</p>
    <p>Really?<ch:nbsp/> You were cheated.<ch:nbsp/>
       My sandwich only cost <ch:euro/>2.</p>
  </body>
</html>

When you apply the xmlchar stylesheets to such a document, you get this HTML result, which has the correct HTML "&pound;", "&euro;", and "&nbsp;" entities.

<html>
  <body>
    <p>My sandwich cost &pound;2.</p>
    <p>Really?&nbsp; You were cheated.&nbsp;
       My sandwich only cost &euro;2.</p>
  </body>
</html>

The xmlchar stylesheets are designed to be used in combination with your existing xslt stylesheets. Simply use <xsl:import> to call in the html4-all.xsl stylesheet. Any character elements you've added to your document will be converted to the appropriate character entity in the output. The stylesheet that was used for this example is

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:import href="xmlchar-1.1/html4-all.xsl"/>
  <xsl:param name="xmlcharNsUri"
    select="'http://xmlchar.sf.net/ns#'"/>
  <xsl:output method="html"/>

  <!-- You need to explicitly call the imported xmlchar -->
  <!-- templates if they would be overridden by the     -->
  <!-- templates in the importing file,                 -->
  <!-- as is the case here.                             -->
  <!-- This can always be avoided, though.              -->
  <xsl:template match="*[namespace-uri() = $xmlcharNsUri]">
    <xsl:apply-imports/>
  </xsl:template>

  <!-- Everything from here copies the non-xmlchar     -->
  <!-- content unchanged from the input to the output. -->
  <xsl:template
    match="text()|comment()|processing-instruction()">
    <xsl:copy/>
  </xsl:template>
  <xsl:template match="*">
    <xsl:text/>
    <xsl:element name="{name()}"
      namespace="{namespace-uri()}">
      <xsl:for-each select="@*">
        <xsl:copy/>
      </xsl:for-each>
      <xsl:apply-templates/>
    </xsl:element>
  </xsl:template>
</xsl:transform>

Preserving xmlchar elements

When an XML document passes through a series of processing stages, the xmlchar elements will normally be preserved until the final stage. If your XSLT stylesheets have been designed to copy elements by default, then the xmlchar elements will be preserved as required.

Otherwise, you can import the xmlchar "copy" stylesheets to make sure that xmlchar elements are copied from input to output. Consider the following example:

<html xmlns:ch="http://xmlchar.sf.net/ns#">
  <head>
    <title>xmlchar 1.0 - Test</title>
  </head>
  <body>
    <h1>xmlchar 1.0 Test</h1>
    <p>My sandwich cost <ch:pound/>2.</p>
    <p>Really?<ch:nbsp/> You were cheated.<ch:nbsp/>
      My sandwich only cost <ch:euro/>2.</p>
  </body>
</html>

This is translated from XHTML+xmlchar into DocBook+xmlchar using the following stylesheet. However, since DocBook and XHTML do not use the same elements, the stylesheet cannot be set to copy content by default, so xmlchar elements will have to be mapped explicitly. To do this, the xmlchar "copy" stylesheet html4-all-copy.xsl is imported into the stylesheet.

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:import href="xmlchar-1.1/html4-all-copy.xsl"/>
  <xsl:output method="xml"/>
  <xsl:template match="/html">
    <xsl:text/>
    <article xmlns:ch="http://xmlchar.sf.net/ns#">
      <title>
        <xsl:apply-templates select="/html/head/title/node()"/>
      </title>
      <xsl:for-each select="/html/body/h1">
        <section>
          <title>
            <xsl:apply-templates select="current()/node()"/>
          </title>
          <xsl:variable name="theTitle" select="."/>
          <xsl:apply-templates
            select = "/html/body/*[preceding-sibling::h1[text() = $theTitle]]"/>
        </section>
      </xsl:for-each>
    </article>
  </xsl:template>
  <xsl:template match="p">
    <para>
      <xsl:apply-templates/>
    </para>
  </xsl:template>
</xsl:transform>

The result is

<article xmlns:ch="http://xmlchar.sf.net/ns#">
<title>xmlchar 1.0 - Test</title>
<section>
<title>xmlchar 1.0 Test</title>
<para>My sandwich cost <ch:pound/>2.</para>
<para>Really?<ch:nbsp/> You were cheated.<ch:nbsp/>
  My sandwich only cost <ch:euro/>2.</para>
</section>
</article>

in which the xmlchar elements have been preserved as required. In a later part of the document process, the xmlchar elements would be transformed into character entities to produce standard DocBook output.

Converting legacy documents with character entities

If you want to convert legacy XML documents containing named HTML 4 character entities to use the xmlchar elements instead, you can use the xmlchar entity definitions. These expand the entities into their matching xmlchar elements.

Warning: the xmlchar entity definitions must not be used on XML documents that contain character entities in attribute values. Doing so will produce ill-formed XML.

Consider the following legacy document, which contains named character entities. The document is modified to import the xmlchar entities.

<!DOCTYPE html [
<!ENTITY % html.4.entities
  SYSTEM "xmlchar-1.1/html4-all.ent">
%html.4.entities;
]>
<html xmlns:ch="http://xmlchar.sf.net/ns#">
  <head>
    <title>xmlchar 1.1 - Test</title>
  </head>
  <body>
    <h1>xmlchar 1.1 Test</h1>
    <p>My sandwich cost &pound;2.</p>
    <p>Really?&nbsp; You were cheated.&nbsp;
      My sandwich only cost &euro;2.</p>
  </body>
</html>

The entities will be expanded into xmlchar elements when this file is parsed. To show that it works, when this file is transformed by the following "copy-through" XSLT stylesheet

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" encoding="ISO-8859-1"/>
  <xsl:template match="/html">
    <xsl:text/>
    <html xmlns:ch="http://xmlchar.sf.net/ns#">
      <xsl:apply-templates/>
    </html>
  </xsl:template>
  <xsl:template
    match="text()|comment()|processing-instruction()">
    <xsl:copy/>
  </xsl:template>
  <xsl:template match="*">
    <xsl:text/>
    <xsl:element name="{name()}"
      namespace="{namespace-uri()}">
      <xsl:for-each select="@*">
        <xsl:copy/>
      </xsl:for-each>
      <xsl:apply-templates/>
    </xsl:element>
  </xsl:template>
</xsl:transform>

the result is

<html xmlns:ch="http://xmlchar.sf.net/ns#">
  <head>
    <title>xmlchar 1.1 - Test</title>
  </head>
  <body>
    <h1>xmlchar 1.1 Test</h1>
    <p>My sandwich cost <ch:pound/>2.</p>
    <p>Really?<ch:nbsp/> You were cheated.<ch:nbsp/>
      My sandwich only cost <ch:euro/>2.</p>
  </body>
</html>

where the character entities have been converted to xmlchar elements as required.

Conclusion

Named character elements provide a natural way to use named special characters in XML documents, although they only work for element content and not for attribute values. The xmlchar XSLT library provides element equivalents for all of the special characters from HTML 4.


Comment on this articleShare your comments on this article in our forum.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Titles Only Newest First
  • Character entities are not needed
    2003-01-08 04:45:45 Lars Marius Garshol [Reply]

    What is really needed is better editors. That's all. If editors make it easy to enter and identify obscure characters the whole issue just goes away. And, frankly, I wish it would. Character entities are a pain, and I'm not at all convinced character elements are any nicer.

    • Character entities are not needed
      2003-01-08 07:07:15 Anthony Coates [Reply]

      I don't disagree with you, but we wanted to deliver a here-and-now solution. A lot of XML is still produced with vi and Notepad, and there is still a need to produce XML on systems that do not support Unicode. Cheers, Tony.

      • Character entities are not needed
        2003-01-12 21:07:52 Eric Schwarzenbach [Reply]

        I think a better here-and-now solution is to simply stop using weak editors like vi and notepad. There ARE plenty of editors that provide workable solutions for entering such characters. The way they provide it may not be perfectly convenient and optimal but they are workable, and no less convenient, I think, than choosing from a new set of character elements.


        Editplus or any of a dozen other notepad replacements would be one reasonable solution for a low budget PC environment.

      • Character entities are not needed
        2003-01-09 14:21:22 Lars Marius Garshol [Reply]

        As such I think it has merit, but, still, it doesn't hurt to try to look a bit further ahead. It would be good if people could start pestering their editor vendors for better solutions to this problem.

  • Many elements...
    2003-01-06 07:42:51 David Carlisle [Reply]

    Having many extra elements (around 2000 more if you did this for MathML rather than XHTML) does cause some problems if you do wish to validate the document at any point using schema or dtd.
    Even if the end result is delivered as (just) well formed XML it is often useful to have a dtd to constrain authoring.


    Some drafts of MathML2 had a
    <mchar name="...">
    element that had similar effect to this xmlchar proposal, but used a single element with an attribute.
    Docbook has a more or less similar uchar proposal.
    This has a much less drastic impact on a dtd (especially if the DTD doesn't constrain the name values)



    For these reasons mchar was removed from MathML2 at last call. It was understood that XML core WG would look at the problem.


    The "XML Core WG View" document being the rather unsatisfactory outcome....





    • Many elements...
      2003-01-08 02:24:28 Anthony Coates [Reply]

      Actually, I would have thought 2000 elements would be manageable these days. The whole of Unicode in one include file would be a problem. However, we built xmlchar so that you don't have to include all of the HTML4 characters unless you want them all. Similarly, nobody is going to use all 2000 MathML characters in one document, so grouping them into, say, 10 groups would make them manageable using the xmlchar approach. I'm a physicist by training, and used LaTeX for years, so I'm quite familiar with what is required for mathematical typesetting. Cheers, Tony.

  • Include xmlchar in XSLT Standard Library?
    2003-01-05 13:02:34 Michael Strasser [Reply]

    Perhaps xmlchar should be included in the XSLT Standard Library


    (BTW, double spaces between sentences is not good typographical practice. It is one of the many hangovers from typewriters.)



    • Include xmlchar in XSLT Standard Library?
      2003-01-08 02:19:40 Anthony Coates [Reply]

      I will look into getting cross links between the XSLT Standard Library site & the xmlchar sites. As for double spacing, I'm not American, and I had never seen French spacing (single spacing) until I started reading Web pages, where the lack of sentence markup makes double spacing impractical. I still don't like the visual look of single spacing. Cheers, Tony.

  • More on avoiding human-readable text in attributes
    2003-01-04 14:35:32 Micah Dubinko [Reply]

    I did a small bit of research on what XHTML 2.0 would look like if all the text currently allowed in attributes (title, alt, etc.) were moved to elements. The results are at:
    http://dubinko.info/writing/elemental/


    My conclusion: this wouldn't be as painful as it first seems, and it would allow something like named character elements to be a core part of the language.


    .micah