Translating XML Documents with xml:tm
by Andrzej Zydron
|
Pages: 1, 2, 3
Matching with xml:tm
xml:tm provides much more focused types of matching than traditional translation memory systems. The following types of matching are available:
-
Perfect matching
Author memory provides exact details of any changes to a document. Where text units have not been changed for a previously translated document we can say that we have a “perfect match”. The concept of perfect matching is an important one. With traditional translation memory systems a translator still has to proof each match, as there is no way to ascertain the appropriateness of the match. Proofing has to be paid for, typically at 60% of the standard translation cost. With perfect matching there is no need to proof, thereby saving on the cost of translation.
-
In document leveraged matching
xml:tm can also be used to find in-document leveraged matches which will be more appropriate to a given document than normal translation memory leveraged matches.
-
Leveraged matching
When an xml:tm document is translated the translation process provides perfectly aligned source and target language text units. These can be used to create traditional translation memories, but in a consistent and automatic fashion.
-
In document fuzzy matching
During the maintenance of author memory a note can be made of text units that have only changed slightly. If a corresponding translation exists for the previous version of the source text unit, then the previous source and target versions can be offered to the translator as a type of close fuzzy match.
-
Fuzzy matching
The text units contained in the leveraged memory database can also be used to provide fuzzy matches of similar previously translated text. In practice fuzzy matching is of little use to translators except for instances where the text units are fairly long and the differences between the original and current sentence are very small.
-
Non-translatable text
In technical documents you can often find a large number of text units that are made up solely of numeric, alphanumeric, punctuation or measurement items. With xml:tm these can be identified during authoring and flagged as non translatable, thus reducing the word counts. For numeric and measurement only text units it is also possible to automatically convert the decimal and thousands designators as required by the target language.
The following is an example of non-translatable text in xml:tm:
...
<text:list-header>
<text:p text:style-name="P9">
<tm:te id="e41">
<tm:tu id="u41.1">
Some
new text with examples of text that does not require
translation:
</tm:tu>
</tm:te>
</text:p>
</text:list-header>
</text:ordered-list>
<text:p text:style-name="Hanging indent">
<tm:te id="e42">
<tm:tu id="u42.1" type="measure">
10
mm
</tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="Hanging indent">
<tm:te id="e43">
<tm:tu id="u43.1" type="measure">
10.50
m
</tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="Hanging indent">
<tm:te id="e44">
<tm:tu id="u44.1"type="numeric">
10,000
</tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="P13">
<tm:te id="e45">
<tm:tu id="u45.1" type="numeric">
9.956
</tm:tu>
</tm:te>
</text:p>
<text:p text:style-name="P13">
<tm:te id="e46">
<tm:tu id="u46.1" type="alphanum">
ABC104/EF
</tm:tu>
</tm:te>
</text:p>
...
And an example of the composed text:
|
| Figure 5: Composed Non-Translatable Text |
Word Counts
The output from the text extraction process can be used to generate automatic word and match counts by the customer. This puts the customer in control of the word counts, rather than the supplier. This is an important distinction and allows for a tighter control of costs.
XLIFF and Online Translation
xml:tm translatable files can be created in XLIFF format. The XLIFF format can then be used to create dynamic web pages for translation. A translator can access these pages via a browser and undertake the whole of the translation process over the Web. This has many potential benefits. The problems of running filters and the delays inherent in sending data out for translation such as inadvertent corruption of character encoding or document syntax, or simple human work flow problems can be totally avoided. Using XML technology it is now possible to both reduce and control the cost of translation as well as reduce the time it takes for translation and improve the reliability.
Traditional translation scenario:
|
| Figure 6: Traditional Translation Scenario |
In the xml:tm translation scenario all processing takes place within the customer's environment:
|
| Figure 7: xml:tm Translation Scenario |
An example of a web based translator environment can be seen at the following URL: http://www.xml-intl.com/demo/trans.html
Benefits of using xml:tm
The following is a list of the main benefits of using the xml:tm approach to authoring and translation:
-
The ability to build consistent authoring systems.
-
Automatic production of authoring statistics.
-
Automatic alignment of source and target text.
-
Aligned texts can be used to populate leveraged matching tm database tables.
-
Perfect translation matching for unchanged text units.
-
In-document leveraged and modified text unit matching.
-
Automatic production of word count statistics.
-
Automatic generation of perfect, leveraged, previous modified or fuzzy matching.
-
Automatic generation of XLIFF files.
-
Protection of the original document structure.
-
The ability to provide on line access for translators.
-
Can be used transparently for relay translation.
Summary
xml:tm is a namespace based technology created and maintained by Xml-Intl based on XML and XLIFF for the benefit of the XML community. Full details of the xml:tm definitions (XML Data Type Definition and XML Schema) are available from the Xml-Intl web site (http://www.xml-intl.com). Xml-Intl also supplies an implementation of xml:tm using Java and Oracle, which includes linguistically aware database leveraged and fuzzy matching.
There are future plans to incorporate a grammatical namespace in addition to the text memory namespace so that grammatical information can be embedded into XML documents and exchanged between applications.
xml:tm is best suited to enterprise level implementation for corporations with a large annual translation requirement and a content management system. During the implementation process xml:tm is integrated with the customer’s content management system.
The xml:tm approach reduces translation costs in the following ways:
-
Translation memory is held by the customer within the documents.
-
Perfect matching reduces translation costs by eliminating the need for translators to proof these matches.
-
Translation memory matching is much more focused than is the case with traditional translation memory systems providing better results.
-
It allows for relay translation memory processing via an intermediate language.
-
All translation memory, extraction and merge processing is automatic, there is no need for manual intervention.
-
Translation can take place directly via the customers web site.
-
All word counts are controlled by the customer.
-
The original XML documents are protected from accidental damage.
-
The system is totally integrated into the XML framework, making maximum use of the capabilities of XML to address authoring and translation.
What is your experience with maintaining translated documents? Would xml:tm help you? Ask questions or comment on this article in our forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- XLIFF to WEb
2007-03-04 10:49:27 hbaraona [Reply]
I read the article and I ma intersted in learning more about how Translation can take place directly via the customers web site or our internal portal.
Hector Baraona
VP Operations
Telelingua USA
- XLIFF to WEb
2008-03-13 02:46:52 Andrzej Zydron [Reply]
Hi Hector,
Apologies for the delay in replying, but I somehow missed your post. If you visit our web site you can get full details of the web based translator workbench.
Best Regards,
AZ
- XLIFF to WEb
- XLIFF to WEb
2007-03-04 10:49:23 hbaraona [Reply]
I read the article and I ma intersted in learning more about how Translation can take place directly via the customers web site or our internal portal.
Hector Baraona
VP Operations
Telelingua USA
- xml:tm now an official LISA OSCAR Standard
2007-03-01 03:56:43 Andrzej Zydron [Reply]
xml:tm was adopted by OSCAR as an official standard for the globalization industry on February 26, 2007. xml:tm provides a radical new approach to the task of authoring and translating XML documents. Please visit the xml:tm page on the LISA web site for full details (http://www.lisa.org/standards/xmltm/).
- xml:tm approved as a LISA OSCAR Standard
2006-07-26 01:26:39 Andrzej Zydron [Reply]
xml:tm as been approved on 17th July 2006 by the LISA OSCAR Steering Committee for public comment prior to final ratification as a standard. XML-INTL has provided the driving force and technical architecture for this critical Localization Industry Standard. The xml:tm standard was developed by XML-INTL and donated to LISA OSCAR for consideration as an OSCAR standard. "This is a great day for the XML and localization communities. xml:tm provides a radical new way of approaching the authoring and localization of XML documents. It is a perfect companion standard to DITA (Darwin Information Technology Architecture)" stated XML-INTL CTO Andrzej Zydroń.
xml:tm is the vendor-neutral open XML standard for embedding text memory within an XML document. xml:tm leverages the namespace syntax of XML to embed text memory information within the XML document itself. xml:tm provides a radical new approach to the task of authoring and translating XML documents. To learn more about xml:tm, please visit the LISA OSCAR xml:tm page - http://www.lisa.org/standards/xmltm/
- xml:tm approved as a LISA OSCAR Standard
2008-03-12 16:10:47 T2aki [Reply]
I am frustrated by a limitation in a Translation Management System we are about to use to translate DITA XML files. The TMS doesn't allow translators to add or remove in-line DITA tags, like "uicontrol", used in the source. All languages have to have the same number of these tags as English. This is the idea.
For example, if the source reads "Click <uicontrol>OK</uicontrol>", then your translation must have one uicontrol tag, not two. Well, normally this is fine.
But, especially when translating into Japanese, in which you need to add different or localized information, this limitation causes troubles. This limitation might work well to make translations consistent or to verify tag structures in translation. There might be a workaround but the limitation itself is against the localization idea, I think.
Ah, what do you think about this limitation?
- xml:tm approved as a LISA OSCAR Standard
- Evaluation of XMl Source for translation
2006-07-10 06:19:29 wbeadle [Reply]
Does anyone know of an independent XML expert who could evaluate an XML English source file used for translations. The translations that come back from our vendor lose formatting, drop characters, and insert others. The translator claims the problem is with the XML being produced by our application. Our application provider claims it is the translator's error.
- Evaluation of XMl Source for translation
2006-07-26 06:48:02 Andrzej Zydron [Reply]
As long as you XML document parses correctly with an XML 1.0 compliant parser such as Apache Xerces, then there is nothing wrong with your data. Unless your translation supplier is conversant with XML and checks that the document structure and encoding are not being changed by parsing the document before returning it to you, then there is every opportunity of corruption of the document arising.
- Evaluation of XMl Source for translation
- xml:tm adopted as proposed LISA OSCAR work item
2005-06-07 10:59:48 Andrzej Zydron [Reply]
xml:tm has been donated to the LISA OSCAR TC (http://www.lisa.org/oscar/) by XML-INTL. It has been accepted as an OASCAR work item, the first stage on the road to being accepted as an official LISA OSCAR standard.
- "inline" formats
2004-11-15 19:08:19 christof@gmail.com [Reply]
thank you very much for this very interesting article. One thing I am interested in is how xml:tm deals with word formats - e.g. if the word "example" in the sentence "this is an example to show how it works" is underlined. are there provisions for tags to communicate this to the translator? doesn't it apply because the format only deals with more or less unformatted CMS content?
Best regards
Christof
- undiscussed issue
2004-02-03 06:49:17 Alexander Kudinov [Reply]
Dear Andrzej,
Thanks you for your splendid article! It was of great importance to me because I?m particularly interested in any information on applying XML to the translation process.
Perhaps I?ve missed something, but there seem to be an issue that wasn?t discussed. I mean the situation when a sentence in the source text corresponds two sentences in the targeted text, or, two sentences in the source text correspond a sentence in the targeted text. How does xml:tm address this issue?
Thanks.
Alexander
- undiscussed issue
2004-02-03 07:57:34 Andrzej Zydron [Reply]
Hi Alexander,
Thank you very much for your question. It shows a detailed understanding of the issues involved. If a text unit in the source is translated as two sentences in the target language there is no problem. The target language text unit will contain two sentences that are equivalent in translation to one text unit in the source. Conversely if two or more source text units are rendered by one or more equivalent target language text units then within xml:tm there is an attribute for tm:tu elements called "flag". You can use this attribute to specify that the translation for one or more text units have been merged within the preceding one by setting the flag attribute value to "merged". In this way a target language translation can refer to more than one text unit (tm:tu). The only restriction is that you cannot cross text element (tm:te) boundaries.
A full detailed specification of xml:tm is now available at the following URL:
http://www.xml-intl.com/docs/specification/xml-tm.html
This goes into much more detail than could be rendered in the article.
If you have any more questions please do not hesitate in asking.
Regards,
AZ
- undiscussed issue
2004-02-05 06:52:38 Alexander Kudinov [Reply]
Andrzej,
Thank you for your reply. It was a real help and cleared up the issue.
Alexander
- undiscussed issue
- undiscussed issue
- Automated segmentation via CMS
2004-01-13 15:32:53 Don Smith [Reply]
Ah, I think I understand now, but I want to be sure: the segmentation of the sentences that occurs in xml:tm is automated through the CMS and never seen by the content creator.
- Automated segmentation via CMS
2004-01-14 14:17:59 Andrzej Zydron [Reply]
Correct!
- Automated segmentation via CMS
- More on Mapping
2004-01-13 02:59:21 Don Smith [Reply]
I'm interested in understanding better the relationship between the original XML document and the mapped xml:tm document you illustrate in Figure 1.
Does xml:tm assume that people create content in XML using customized document types or in xml:tm itself? If the former, then I assume that moving from my own document type to xml:tm is a straight XSLT transformation. In that case, I have a question about sentence segmentation.
Most document types do not use markup to distinguish sentences in the original customized document type. Does xml:tm assume that a customized document type will segment sentences? (Does the <text> element in Figure 1 perform the function of segmenting sentences in the source document type?)
Also, the PDF at http://www.xml-intl.com/docs/xml-tm-whitepaper.pdf appears to be bad since when I try to download it I got a seven page document with nothing in it and a locked-up web browser.
- More on Mapping
2004-01-13 12:47:17 Andrzej Zydron [Reply]
In answer to some of the other points that you raised:
The allocation of the xml:tm namespace should be implemented automatically by a program designed for that purpose. The maintenance of the xml:tm namespace should also be done by program. The xml:tm namespace should not be visible for any authoring or printing operations. It should be stripped out for these purposes as described in my other answer.
Regarding the "text" in Figure 1 it represents the #PCDATA text of the element, as the the "sentence" components. The difference being that there is no identifiable segmentation for the "text" into separate sentences, as for example would be the case for a section title.
- More on Mapping
2004-01-13 12:25:10 Andrzej Zydron [Reply]
Thank you very much for your feedback.
xml:tm is best suited for use within a content management system (CMS). The way to use xml:tm within XML documents is to only hold the namespace data within the CMS. When the document is checked out for authoring the namespace should be stripped out. On checking in to the CMS the namespace data is updated by inserting the xml:tm namespace (via the segmentation process) into the new version of the document and comparing the two namespace versions of the document – a process called DOM differencing. Similarly with printing – the namespace should be stripped out when the document is sent for printing via FOP or XEP for instance. In this way the xml:tm namespace is transparent to the authoring or printing environments and does not impinge in any way. Stripping out the xml:tm namespace is a trivial operation using XSLT.
I have checked the PDF file again with IE 6.0, Netscape 7.1 and Mozilla Firebird 0.6.1 and it works in all three browsers. It is rather big, and can cause problems if you try and open it before it has fully downloaded in your browser. Wait until you can see the first page being displayed in the Acrobat browser plugin., otherwise it will complain, or alternatively try downloading it rather than displaying it directly in your browser, using the "shift+click" option available in most browsers.
If you have any more questions please do not hesitate in raising another feedback
Regards,
A.Zydron
- More on Mapping
- OASIS Language Translation Techniques
2004-01-08 13:03:22 David Webber [Reply]
The author is clearly not aware of the OASIS CAM specification - which has been employing content references to allow automated crosswalks for two years+ now. Not exactly "revolutionary" - but we'll let the pass as being good marketing 'ink'.
CAM templates also mean you do NOT have to embed all the syntax into the XML instance - a major advantage for eBusiness transaction processing.
You can find out more on the CAM approach at:
http://cam.swiki.net
David Webber,
Chair, OASIS CAM TC.
- OASIS Language Translation Techniques
2005-06-12 05:01:25 Andrzej Zydron [Reply]
Hi David, I have finally had enough time to look at CAM in detail. I was somewhat hampered in this by the lack of published articles about CAM and the specification document itself, along with a slightly chaotic and disappointing web site. I always worry when I see an XML specification written in MS Word.
Having said all that, I think CAM is a very good technology and I plan to use it as the basis of the localization directives specification which I will be involved with (hopefully) shortly. So thank you for bringing it to my attention. Our exchange has been very useful for me.
However CAM is not a substitute for xml:tm. xml:tm is about embedding and tracking id references in documents, along with DOM differencing. Non of this is achievable as a notation with CAM. CAM can be used as the basis for localization directives though. At the moment xml:tm uses the analysis file concept for defining translatable text, inline elements and translatable attributes. CAM offers a better substitution for the analysis file concept as part of a greater localization directives specification initiative.
Best Regards,
AZ
- OASIS Language Translation Techniques
2004-01-10 04:23:22 Andrzej Zydron [Reply]
Hi David,
Thank you very much for your feedback.
There is now a detailed specification for xml:tm available on line at the following URL:
http://www.xml-intl.com/docs/specification/xml-tm.html
as well as a white paper at the following URL:
http://www.xml-intl.com/docs/xml-tm-whitepaper.pdf
These links may provide some more insight concerning the detailed reasons behind the design of xml:tm.
Xml:tm is revolutionary, not in the XML sense – it has a very simple structure, but in the realm of translation memory and its use for XML documents, aligned with the concept of "perfect matching". It leverages the syntax of XML to substantially improve the translation process for XML documents. As far as I am aware (and I have been giving presentations and publishing articles on xml:tm at international symposia for nearly a year) no one has come up with a similar solution for the translation of XML documents.
Even though I participate in two OASIS TCs (Trans WS as a member and XLIFF as an observer) I must admit to not being aware of CAM. This is one of the benefits of publishing articles - you find out about so many interesting things from feedback! It is difficult to keep abreast of all the very useful work being done within OASIS, W3C and Lisa's OSCAR. CAM certainly looks very interesting. I will be studying the complex CAM spec in detail and will reply in due course.
Regards,
A.Zydron
- OASIS Language Translation Techniques
