XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Translating XML Documents with xml:tm

January 07, 2004

A Russian translation of this article is available at xmlhack.ru


Sooner or later someone will want to have your XML document translated into another language. In fact XML documents are much easier to translate than other electronic documents because they separate out form from content, and they conform to a rigorous standard and defined syntax. There are various approaches to improving the translation process.

Machine Translation

Language technology has had a mixed history over the past 40 years. The early promises of cheap automated translation soon lead to dissolution and effectively a marginal role for this technology in providing "gisting" information for certain foreign language texts. There have been significant advances in language technology since then, and we all benefit from these on a day to day basis when we use spelling and grammar checkers and complex search engines. Nevertheless we are still a long way away from usable machine translation based on free format text, although there has been some success for very tightly controlled text in very narrow domains.

Translation memory

In order to reduce translation costs in an environment where documentation can change frequently to reflect improvements and innovation in a product lifecycle the best answer to date has been the use of translation memory. In comparison with machine translation this is a relatively primitive approach to language technology but can bring considerable benefits.

Translation memory works by aligning previously translated text in a target language with the source language. This is accomplished either by the use of a manual tool or automatically by using a controlled environment for the translation process. Alignment is usually done at a sentence level. This affords the best level of usable granularity. The aligned source and target text is held in a repository. The next time the document is updated the repository is searched in order to locate any text that has not changed. Where such a sentence is identified the source language text can be replaced with the target language text. This low tech method has nevertheless provided benefits in terms of translation consistency and reduced costs.

The main weakness of this approach is the fact that how a piece of text is translated in a given target language can depend on its context. When text is pulled in from a translation memory repository it does not posses any of the context within which it existed in the original document. Because there is no contextual information regarding the target language text, a translator is still required to proof read the matched text and adapt it if required. The proof reading process, although less expensive than straight forward translation, still consumes time and money.

Translating XML Documents

The approach to translating XML documents to date has been to extract the translatable text and attributes into an external, typically proprietary format where translation memories matches are performed on the data. On completion of the translation process the newly translated sentences are written to traditional non-standard translation memory repositories. XML in these sorts of environments is treated merely as yet another encoding format.

Special mention must be made here of some important XML based standards concerning translation technology:

  1. the OASIS XLIFF (XML Localisation Interchange File Format - http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff) specification which provides an XML framework for interchanging translatable text from any native format. More about XLIFF later.

  2. Lisa (the Localisation Industry Standards Association) provides many XML based initiatives under the auspices of its OSCAR working committee (Open Standards for Container/Content Allowing Re-use - http://www.lisa.org/oscar) which include amongst others TMX ( http://www.lisa.org/tmx/tmx.htm). TMX allows for the interchange of translation memories using XML.

All of these excellent standards address the interchange of information using XML rather than the actual translation of XML documents.


xml:tm is a new approach to the problem of translation for XML documents. It is a XML namespace based syntax that uses the power of XML to embed additional information within the XML document itself.

At the core of xml:tm is the concept of “text memory”. Text memory is made up of two components:

  1. Author Memory

  2. Translation Memory

Author Memory

XML namespace is used to map a text memory view onto a document. This process is called segmentation. The text memory view works at the sentence level -- the text unit. Each individual xml:tm text unit is allocated a unique identifier. This unique identifier is immutable for the life of the document. As a document goes through its life cycle the unique identifiers are maintained and new ones are allocated as required. This aspect of text memory is called author memory. It can be used to build author memory systems which can be used to simplify and improve the consistency of authoring.

The following diagram shows the how the tm namespace maps onto an existing XML document:

xml:tm mapping diagram
Figure 1: xml:tm mapping diagram

In this diagram "te" stands for "text element" (an XML element that contains text) and "tu" stands for "text unit" (a single sentence or stand alone piece of text).

The following is an example of part of an xml:tm document. The xml:tm elements are highlighted in red to show how xml:tm maps onto an existing XML document.:

<?xml version="1.0" encoding="UTF-8" ?>
xmlns:tm="urn:xmlintl-tm-tags" xmlns:xlink="http://www.w3.org/1999/xlink">
<text:p text:style-name="Text body">
<tm:te id="e1" tuval="2">
<tm:tu id="u1.1"> Xml:tm is a revolutionary technology for dealing with the problems of translation memory for XML documents by using XML techniques to embed memory directly into the XML documents themselves. </tm:tu>
<tm:tu id="u1.2"> It makes extensive use of XML namespace. </tm:tu>
<text:p text:style-name="Text body">
<tm:te id="e2">
<tm:tu id="u2.1"> The “tm” stands for “text memory”. </tm:tu>
<tm:tu id="u2.2"> There are two aspects to text memory: </tm:tu>
<text:ordered-list text:continue-numbering="false" text:style-name="L1">
<text:p text:style-name="P3">
<tm:te id="e3">
<tm:tu id="u3.1"> Author memory</tm:tu>
<text:p text:style-name="P3">
<tm:te id="e4">
<tm:tu id="u4.1"> Translation memory</tm:tu>

And the composed document:

Composed document
Figure 2: Composed Document

Pages: 1, 2, 3

Next Pagearrow