XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Translating XML Documents with xml:tm
by Andrzej Zydron | Pages: 1, 2, 3

Matching with xml:tm

xml:tm provides much more focused types of matching than traditional translation memory systems. The following types of matching are available:

  1. Perfect matching

    Author memory provides exact details of any changes to a document. Where text units have not been changed for a previously translated document we can say that we have a “perfect match”. The concept of perfect matching is an important one. With traditional translation memory systems a translator still has to proof each match, as there is no way to ascertain the appropriateness of the match. Proofing has to be paid for, typically at 60% of the standard translation cost. With perfect matching there is no need to proof, thereby saving on the cost of translation.

  2. In document leveraged matching

    xml:tm can also be used to find in-document leveraged matches which will be more appropriate to a given document than normal translation memory leveraged matches.

  3. Leveraged matching

    When an xml:tm document is translated the translation process provides perfectly aligned source and target language text units. These can be used to create traditional translation memories, but in a consistent and automatic fashion.

  4. In document fuzzy matching

    During the maintenance of author memory a note can be made of text units that have only changed slightly. If a corresponding translation exists for the previous version of the source text unit, then the previous source and target versions can be offered to the translator as a type of close fuzzy match.

  5. Fuzzy matching

    The text units contained in the leveraged memory database can also be used to provide fuzzy matches of similar previously translated text. In practice fuzzy matching is of little use to translators except for instances where the text units are fairly long and the differences between the original and current sentence are very small.

  6. Non-translatable text

    In technical documents you can often find a large number of text units that are made up solely of numeric, alphanumeric, punctuation or measurement items. With xml:tm these can be identified during authoring and flagged as non translatable, thus reducing the word counts. For numeric and measurement only text units it is also possible to automatically convert the decimal and thousands designators as required by the target language.

The following is an example of non-translatable text in xml:tm:

...
<text:list-header>
<text:p text:style-name="P9">
<tm:te id="e41">
<tm:tu id="u41.1"> Some new text with examples of text that does not require translation: </tm:tu>
</tm:te>
</text:p>
</text:list-header>
</text:ordered-list>
<text:p text:style-name="Hanging indent">
<tm:te id="e42">
<tm:tu id="u42.1" type="measure"> 10 mm </tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="Hanging indent">
<tm:te id="e43">
<tm:tu id="u43.1" type="measure"> 10.50 m </tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="Hanging indent">
<tm:te id="e44">
<tm:tu id="u44.1"type="numeric"> 10,000 </tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="P13">
<tm:te id="e45">
<tm:tu id="u45.1" type="numeric"> 9.956 </tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="P13">
<tm:te id="e46">
<tm:tu id="u46.1" type="alphanum"> ABC104/EF </tm:tu>
</tm:te>
</text:p>
...

And an example of the composed text:

Composed non-translatable text
Figure 5: Composed Non-Translatable Text

Word Counts

The output from the text extraction process can be used to generate automatic word and match counts by the customer. This puts the customer in control of the word counts, rather than the supplier. This is an important distinction and allows for a tighter control of costs.

XLIFF and Online Translation

xml:tm translatable files can be created in XLIFF format. The XLIFF format can then be used to create dynamic web pages for translation. A translator can access these pages via a browser and undertake the whole of the translation process over the Web. This has many potential benefits. The problems of running filters and the delays inherent in sending data out for translation such as inadvertent corruption of character encoding or document syntax, or simple human work flow problems can be totally avoided. Using XML technology it is now possible to both reduce and control the cost of translation as well as reduce the time it takes for translation and improve the reliability.

Traditional translation scenario:

Traditional translation scenario
Figure 6: Traditional Translation Scenario

In the xml:tm translation scenario all processing takes place within the customer's environment:

xml:tm translation scenario
Figure 7: xml:tm Translation Scenario

An example of a web based translator environment can be seen at the following URL: http://www.xml-intl.com/demo/trans.html

Benefits of using xml:tm

The following is a list of the main benefits of using the xml:tm approach to authoring and translation:

  • The ability to build consistent authoring systems.

  • Automatic production of authoring statistics.

  • Automatic alignment of source and target text.

  • Aligned texts can be used to populate leveraged matching tm database tables.

  • Perfect translation matching for unchanged text units.

  • In-document leveraged and modified text unit matching.

  • Automatic production of word count statistics.

  • Automatic generation of perfect, leveraged, previous modified or fuzzy matching.

  • Automatic generation of XLIFF files.

  • Protection of the original document structure.

  • The ability to provide on line access for translators.

  • Can be used transparently for relay translation.

Summary

xml:tm is a namespace based technology created and maintained by Xml-Intl based on XML and XLIFF for the benefit of the XML community. Full details of the xml:tm definitions (XML Data Type Definition and XML Schema) are available from the Xml-Intl web site (http://www.xml-intl.com). Xml-Intl also supplies an implementation of xml:tm using Java and Oracle, which includes linguistically aware database leveraged and fuzzy matching.

There are future plans to incorporate a grammatical namespace in addition to the text memory namespace so that grammatical information can be embedded into XML documents and exchanged between applications.

xml:tm is best suited to enterprise level implementation for corporations with a large annual translation requirement and a content management system. During the implementation process xml:tm is integrated with the customer’s content management system.

The xml:tm approach reduces translation costs in the following ways:

  • Translation memory is held by the customer within the documents.

  • Perfect matching reduces translation costs by eliminating the need for translators to proof these matches.

  • Translation memory matching is much more focused than is the case with traditional translation memory systems providing better results.

  • It allows for relay translation memory processing via an intermediate language.

  • All translation memory, extraction and merge processing is automatic, there is no need for manual intervention.

  • Translation can take place directly via the customers web site.

  • All word counts are controlled by the customer.

  • The original XML documents are protected from accidental damage.

  • The system is totally integrated into the XML framework, making maximum use of the capabilities of XML to address authoring and translation.



1 to 11 of 11
  1. XLIFF to WEb
    2007-03-04 10:49:27 hbaraona
  2. XLIFF to WEb
    2007-03-04 10:49:23 hbaraona
  3. xml:tm now an official LISA OSCAR Standard
    2007-03-01 03:56:43 Andrzej Zydron
  4. xml:tm approved as a LISA OSCAR Standard
    2006-07-26 01:26:39 Andrzej Zydron
  5. Evaluation of XMl Source for translation
    2006-07-10 06:19:29 wbeadle
  6. xml:tm adopted as proposed LISA OSCAR work item
    2005-06-07 10:59:48 Andrzej Zydron
  7. "inline" formats
    2004-11-15 19:08:19 christof@gmail.com
  8. undiscussed issue
    2004-02-03 06:49:17 Alexander Kudinov
  9. Automated segmentation via CMS
    2004-01-13 15:32:53 Don Smith
  10. More on Mapping
    2004-01-13 02:59:21 Don Smith
  11. OASIS Language Translation Techniques
    2004-01-08 13:03:22 David Webber
1 to 11 of 11