Practical Internationalization

April 18, 2001

Tim Bray's map.net site provides a representation in 2D and 3D of the Open Directory Project's content. In an earlier article for XML.com, Tim explained the architecture of map.net, TAXI. However, the one aspect of map.net that interested me most is its handling of fully internationalized content. Internationalization (often abbreviated as i18n) is often notoriously lacking in many applications and is perceived as difficult. I asked Tim about his experiences with map.net, and how easy it really is to build a web application that supports multiple languages and scripts.

Edd Dumbill: What interests me are the considerations of creating an internationalized XML application. The overall architecture required, pitfalls, and 80/20 points.

Tim Bray: Writing a fully internationalized app is more expensive than ignoring the issues, but not that much. What's really expensive is the task of going back to i18n-izing an existing i18n-oblivious app.

It should also be said that using modern technologies like XML and Java makes it harder to ignore, and easier to implement, good internationalization practices.

An example search result from map.net

ED: Map.net has part of its content in non-Western local languages and character sets. What made you decide to cater to these rather than stick with Western languages?

TB: The most practical reason is that the map.net showcase is getting its data from the Open Directory Project, which has a huge number of pages (in the top-level /World category) which are in languages other than English.

As a matter of principle also, I'm one of the people who helped in forcing the close integration of XML and Unicode. Not only is it ethically unreasonable to maintain the delusion that you can do anything serious on the Net in English only, it's also damn bad for business. Here at Antarcti.ca we want to provide our software to the world, including the 75% of it that uses languages that extend beyond the Latin alphabet.

ED: Is browsing technology up to coping with multilingual web sites?

TB: Better than I'd hoped. Both IE5 and Mozilla seem to have their acts pretty well together. IE will sometimes even realize when it doesn't have the right fonts installed, and it will ask if it can go off to Microsoft and get Japanese or Cyrillic or whatever; we haven't figured out exactly which combination of HTTP headers makes this work yet. Once the fonts are installed, it doesn't seem to care whether you send the stuff in Unicode or a native encoding like JIS or KO18.

If you're not using Unicode, and just reading ordinary HTML pages, the browsers still guess wrong quite a bit, and you have to tell them what to do. If you send UTF-8 along with an HTTP header saying so, I've never seen a modern browser get it wrong.

In terms of the actual rendering and display, even of graphically-challenging or "difficult" languages such as Arabic and Thai, the browsers produce results that look pretty good to my semi-educated eye.

ED: Do you have staff who can read all the languages? How did you know you were getting them right?

TB: Well, we have in-house command of French, several Eastern European languages, Japanese, and Farsi (Iranian) that I know about; also a bit of Arabic. I have done a lot of work in my previous life at Open Text in search issues for different character sets. Outside of the languages that we know about, we can't be sure that we're doing the right thing, but I think the chances are good.

Note the hard issues in search; in Chinese, Japanese, and Korean they don't have helpful spaces between searchable tokens, so the question of what-to-index is rather tricky.

ED: What software tools did you employ to manipulate internationalized data? Commercial or open source? How much did you have to write yourself?

TB: Quite a lot of Perl 5.6 with its new, buggy in places, but essentially well-designed UTF-8 handling. We managed to get it to do search tokenization and multilingual sorting, with a certain amount of effort. The run-time is in C and deals with UTF-8, which in a lot of cases you can just ignore and do strcmp() and strcpy() and so on. For the places you can't, we did have to write routines that copy individual characters, expressed as ints, into and out of UTF-8 character buffers.

We also had to write an amusing Perl script that made .h files to include in our C code to express Perl's idea of what it meant by \W and \w.

Still, the total amount of this code is not that large; less than a thousand lines.

ED: Have you found any parts or languages which you've just avoided as they would be too costly to implement?

TB: No. Note that some of the Open Directory data is not provided in Unicode (grumble), e.g. /World/Armenian, and we really can't do very much with that stuff.

ED: Has anything surprised you about working with non-western content?

TB: There's nothing too technically deep once you get the basic architecture right. There are all sorts of irritating details in handling right-to-left rendering and character sets that have their own quotation marks that aren't " or ', but nothing requiring huge innovations.

ED: How much more do you estimate it has cost to implement an internationalized application rather than just stick with Western content? Would you do it again for future applications?

TB: Less than 25% extra. Probably less than 10%. We'd absolutely do it again without an instant's hesitation.

ED: Internationalization always seems to be the poor relation in XML technology -- I attended a Unicode tutorial at XML'99 and found myself only one of four attendees. Is the position deserved? What more could be done to improve awareness of internationalization?

TB: It's not an XML problem, it's a problem of American psychology. Too many smart, good people who are Americans have trouble seeing past their borders and realizing that we Anglophones are a minority in the big picture. I'd say that the population of people using and deploying XML is probably way more i18n-savvy than the average in the computing professions.