Non-ASCII Characters In URIs

What this says is that if you're pointing at an entity and the URL in your system identifier contains some non-ASCII characters, for example société.html, the processor, before trying to use the URI, should convert the string to UTF-8 (which will yield 2 or more bytes for each non-ASCII character) and then for each of those bytes, express it as the character "%" followed by two hex digits. In the case of société.html, the character é is #xe9, which in UTF-8 would be the two bytes #xdd, #x81; thus the processor should encode this URL as soci%dd%81t%dd%81.html.

In fact, if you want to be really safe and follow the letter of the law, you (or your editing software) should do this before you store the URI in the XML document, because the letter of the law says that URLs really aren't supposed to contain any non-ASCII characters at all.

This may seem a bit clumsy, but it's consistent with the basic rules that are supposed to be followed by other Web software.

Back-link to spec

Copyright © 1998, Tim Bray. All rights reserved.