Speaking Your Language

April 19, 2000

Leigh Dodds

This week XML-Deviant reports on discussions concerning foreign language versions of XML documents and schemas. We also cover recent developments from SML-DEV, a group that is developing simplified XML subsets for specific applications.


The simplest questions often have the most interesting answers. Don Park prompted a discussion on internationalization, when he asked:

When is it appropriate to use non-English tag names?

The answers were varied. Steve Schafer thought that the question was largely irrelevant as few users will ever directly read an XML file:

When you're talking about information that is exposed to users, then yes, it's important that you carefully decide how to present that information so that it is as accessible as possible to the target audience. But markup is hidden away behind the scenes...

...Personally, I think the quasi-human-readable aspect of XML files is highly overrated.

This is a fair point, as many XML schemas are less than readable, even in one's native language. But we're still a long way from having XML integrated seamlessly with all our applications, so it can fade completely into the background.

In some cases a user may wish to, or be required to, read XML markup. Developers are one obvious set of users who will have to interpret markup correctly. They will have to develop code against a given schema, and must be able to read and understand the XML they generate. This is one important user group which shouldn't be disenfranchised.

If you therefore assume that your markup must be readable, how do you answer Don Park's question? Rick Jelliffe thought that the answer depends on your viewpoint:

If it is a schema writer, I would give one answer (the same answer I have been giving since 1994: the developer of a DTD has a reasonable expectation of the schema's usage and they should try to choose the best names based on their reasonable expectation of the writers and consumers of the documents). On the other hand, if it is someone who might want to build parsers or to promote reduced-XML standards who is asking, I would say "who are we to tell someone what language they should use?" and "please don't set us back 20 years"

As a schema designer you must be aware of your audience. In today's environment this will inevitably include a significant number of non-English speaking users.

Simon St. Laurent observed that, while XML provides full support for Unicode, and hence multiple character sets and languages, it doesn't define how multi-lingual features should be managed:

XML 1.0 opened the door to widespread use of Unicode names all over the globe, but provided no tools for managing such names. DTDs simply don't handle equivalence, and there's no way to validate a Japanese-element-name version against an English-element-name DTD.

Luckily, the problem isn't insoluble, although (as ever) there is more than one solution to choose from, and no single standard.

David Megginson pointed out that Architectural Forms provide a means to assert the equivalency of two elements with different names:

In fact, either Architectural Forms or schema subtyping make it possible to derive localized versions of vocabularies. With AFs, it can look something like this:

  <ville my:form="city">Montréal</ville>

  <Großstadt my:form="city">München</Großstadt>

Steve Champeon commented that the same effect could be achieved by simply transforming the document: can't be that difficult to convert any SGML/XML document from one vocabulary to another using XSLT or a Perl script. We often use long_descriptive_function_names in Javascript during development and then optimize them later for delivery. Why not do the same with XML documents?

Ken North highlighted the work of XML/EDI group which has proposed language-neutral Bizcodes.

The XML/edi group has been promoting the use of language-neutral universal reference codes (bizcodes). You rely on semantic information from a central repository, perhaps in combination with AF.

This solution, whilst geared towards standardization of eCommerce transactions, has other benefits. The basic idea is that a central registry holds a list of "codes", each of which identify a particular business object, for example, an invoice. Individual schemas can then define relevant elements in the schema that corresponds with the centrally held definition. The schema designer can use their own element names, which can be in any language. Bizcodes are discussed further in a recent article.

Simon St. Laurent believed that whichever solution is selected, it should be integrated with the existing XML infrastructure:

It's an easy process - we just need a way to integrate that process with existing XML processing infrastructures, rather than leaving it as another application-specific variable we 'just have to hope' is supported.

The requirement that your XML should be internationalized will largely depend on your application. But one important point that this discussion has highlighted is that you should be aware of the users of your schemas, both producers and consumers. Whether this leads to support for foreign languages, or simply the use of clear vocabulary in your schemas, there are benefits to be gained.

St. Laurent neatly summarized the situation:

It seems like there's a lot of room to make a long-term improvement in i18n and markup without creating wildly complex situations.


It may soon be your parser which is speaking a different language: Minimal XML (MinXML). MinXML is another project currently being discussed on the SML-DEV mailing list. SML-DEV's first project, "Common XML", was covered in last week's XML-Deviant, "Filling In The Gaps".

MinXML takes a more reductionist approach than Common XML. Instead of providing usage guidelines, it strips the XML specification down to a central core: elements and text. This makes the preliminary MinXML specification short and to the point. The aim is to take a modular approach to the development of an XML framework, which may ultimately yield better integration between its various parts. These are still very early days however, and the specification itself is still under debate.

The simplicity of MinXML means that implementing a parser is not a hugely complex task. In just over a week, three different Java parsers have been announced, along with a Javascript parser, JasMin, which is implemented in only 50 lines of code!

It's too early to tell how successful MinXML will be. However, judging by the speed with which parsers are being implemented, it could find support among developers wanting to get "back to basics". The recent popularity of Sean McGrath's PYX notation showed that there is a definite market for some simple, yet powerful, tools.