Tangenten ja.

New domain names for people around the world

Extending domain names and TLDs with other letters and characters beyond a-z has been possible since 2003. Mats Dufberg dives into the world of Internationalized domain names, and explains how Arabic, Chinese and Cyrillic is able to function on the web.

Traditional domain names are limited to letters a-z, digits 0-9 and the hyphen '-'. The traditional TLDs (Top Level Domains) are even more limited. They can only consist of letters a-z, no digits. It is obvious to the speakers of most languages of the world that the limitation makes it impossible to create domain names and TLDs that match the words, expressions and names used in their language. As Internet has become an integral part of life and business for more and more people in the world, the need to use Internet for other languages besides English has become obvious.

Scripts and writing systems

The letters a-z, and their upper case counter parts A-Z, belong to the so called Latin script. A script, when it comes to letters and characters, can be defined as a writing system. A writing system could be seen as a collection of alphabets for several languages that use more or less the same letters. E.g. Spanish, Swedish, and Vietnamese share many letters in their alphabets. The three alphabets are part of the Latin script, but all three have letters that go beyond the a-z range, which is actually true for most languages using Latin script. Example of such letters are "ä", "ú" and "ŋ", of which the two first could be seen as "a" and "u", respectively, with diacritical mark. The third is clearly a letter of its own. A diacritical mark could be seen as a "decoration" of a base letter, at least historically, but for most languages that have letters with diacritical marks, those letters are essential for the writing of that language. A letter with a diacritical mark, such as "ä" or "ú", is, however, not permitted if limited to a-z.

Arabic and Chinese writing

Even if most readers of this text can only read and write Latin based alphabets, no-one is probably unaware of other writing systems, such as Arabic or Chinese writing. In those cases, none of those letters or characters are available within the a-z range. We directly see that sticking to the traditional domain name creates severe limits to many users of Internet -- or maybe even most users of Internet.

ASCII

In the early days of computer history, the support for various letters and characters was limited. Computer systems were adapted for different countries and languages, which made the systems incompatible with each other. One such character set and character encoding adoption to English and the US was ASCII (American Standard Code for Information Interchange), on which the original DNS and domain name limitations are based. When it comes to letters, it only contains a-z and the upper case counter parts A-Z. In DNS and domain names, lower case and upper case a-z are treated as equal. Besides a-z, ASCII contains 0-9 and the hyphen '-' plus various separators and symbols not used in domain names, such as ';%$'. There is one special separator that we do use in domain names, the dot '.'. The dot is only used as a separator between the so called labels in the domain name, e.g. in "iis.se".

IDN and Unicode

Extending domain names and TLDs with other letters and characters beyond a-z has been made possible with the help of the IDNA extension (Internationalizing Domain Names in Applications) to domain names and DNS since 2003. IDNA, in turn, is based on a standard for representing letters, characters and symbols of the languages of the world in computer files. That standard is Unicode (http://unicode.org/). The ambition is that all writing systems, even those of extinct languages, should be listed in the Unicode repertoire. It lists more than 120,000 characters, but not all of those are available for domain names, but almost all characters needed for writing any word or name in any written language are available. With that, we have broken the limitation that we started this text with. With the help of IDN we can now create domain names that match almost any word or name in almost any language, at least in theory.

IDN domain names

Traditional domain names, or ASCII domain names, have only one representation. It is always written and used as it is. If the domain name is "iis.se" then we will see it as iis.se both in the web browser, in the email address and in DNS (Domain Name System). DNS is the technology that makes domain name work (for more on DNS, see https://en.wikipedia.org/wiki/Domain_Name_System). To understand what it means that traditional domain names have only one representation, let us compare with IDN domain names, i.e. domain names with characters beyond a-z. Let us use an example that I hope that the readers will be able to read correctly. The domain name "räksmörgås.se" is a registered domain name. There are three letters beyond the ASCII a-z, the second, the sixth and the ninth letters. They are "a", "o" and "a", respectively, with decoration. This is an IDN domain name. In DNS that domain name must be encoded since in DNS we are still limited to the same characters as in traditional domain names. Our example must be encoded as "xn--rksmrgs-5wao1o.se". The encoded form of an IDN domain name is also called an A-label, an ASCII compatible label. The other form, the native form, is also called a U-label, a Unicode label. The encoding always starts with the prefix "xn--", which indicates that the rest must be decoded to make sense.

IDN labels

To be completely correct, I have to modify a little of what I have just written. If we look at "räksmörgås.se" we can see that it consists of two parts, or two labels, separated by a dot (remember that the dot is a separator between labels). To the right of the dot, we have "se", which is actually just a normal traditional TLD, an ASCII domain. To the left, we have "räksmörgås", which is an IDN name, or IDN label. We have to look at each label to conclude if it is a traditional label or an IDN label. The encoding also operates on the label level, not at the entire domain name. The example has also shown that we can combine traditional and IDN labels. When we write "IDN domain names" we mean domain names with one or more IDN labels.

IDN as web address

I wrote above that in DNS we do not have direct support for domain names with other characters, instead we have to encode it. Does that mean that we always have to use the encoded form? No, we should normally not see it, but we still live in the transition from a world with traditional domain names only to a world where IDN domain names is an important part. When we use our web browser we should be able to browse to "www.räksmörgås.se", not using the A-label, but that requires that our browser supports IDN.

Font support

There is, however, one other factor that can prevent us from seeing an IDN domain name in its native form, and that is lack of support of those particular characters or letters on the local computer. To display any character (or letter or symbol) on a device such as computer screen or mobile display, the shape or description of that character must be defined in the font used. The font can be seen as a list of character shapes connected to the Unicode code. Even if the computer can recognize the Unicode code itself, if the shape for that code is missing, the character cannot be displayed. Luckily, the more likely it is that you would need the character, the more likely it is that the font has a shape for it. But that can still cause your browser to display the A-label instead of the U-label.

We are on the way

More work remains until IDN is a natural part of domain names, but at least the tools are there for domain names for people around the world.