This article was published on June 12, 2011

The multilingual web: A year of non-Latin script domain names


The multilingual web: A year of non-Latin script domain names

May 6th 2010 was a monumental day for the Internet. Why? It was the day the first ever entirely non-Latin script country code top-level domains (ccTLD) went live. If that just made you blurt out huh?, let me explain.

The multilingual Web

The World Wide Web has evolved almost beyond all recognition over the past decade – and not just from a technological standpoint. As I wrote last month, the Web has typically been very English-centric, but the non-English speaking Web has been catching up over the past ten years. For example, the use of Arabic online increased by over 2500%, while Chinese and Spanish increased by 12 and 7 times respectively. English didn’t even triple-up, but it didn’t need to grow by all that much – English was already everywhere.

So whilst the Web has always been a global entity geographically, this wasn’t always reflected from a language perspective. This topic in itself merits multiple posts, so I’ll focus specifically here on domain names.

Domain names defined

Domain names are the Internet’s address system and are used to identify Internet Protocol (IP) resources such as websites. They work because they are interoperable with IP addresses everywhere and are always unique – this means anyone, anywhere around the world on any network can arrive at the same destination through typing in the same domain name.

Internationalized domain names

Web content itself has long been compatible with non-Latin based scripts such as Arabic, Hebrew, Chinese and Thai. This is largely because Unicode has been adopted by all the main computing industry leaders such as Apple, HP, IBM, Microsoft, Oracle, Sun…and more. Unicode is a computing industry standard, and its aim is to enable the consistent representation of text, irrespective of the script.

But domain names are different to web content, and because of technical constraints and the need to ensure domain names remain interoperable around the world, the Domain Name System (DNS) has traditionally been restricted to 37 ASCII characters: A-Z, 0-9 and the trusty old hyphen. Internationalized domain names (IDNs) are domains that support one or more non-ASCII characters, such as www.øl.com and 스타벅스코리아.com.

The permitted character set of the DNS has precluded the full representation of many languages in their native alphabets (scripts) within domain names. However, ICANN did approve the Internationalizing Domain Names in Applications (IDNA) system many years ago, and this system maps Unicode strings into the valid DNS character set using Punycode.

In short, this allows the transliteration or conversion between Unicode domain names and their ASCII equivalents (prefixed with xn--), thus allowing users to navigate the Internet in their own language. The IDNA system is designed to ensure that the Web doesn’t fragment into a number of localized versions separated by script.

So, Internationalized Domain Names (IDNs) have been available for registration at the second level for a while, meaning in countries such as Japan you could register a domain using a local script rather than a Latin-based one – however, it would still have been appended with ‘.jp’, rather than a local script equivalent.

And this was the big change that came into effect last year. It became possible to register IDNs for ccTLDs such as السعودية. for Saudi Arabia, and .рф for Russia, and this at last meant domain names – including the country code – could contain non-Latin based characters throughout. This opened up the Internet’s addressing system to the majority of the world’s population, who have little comprehension of Latin-based scripts.

Ensuring interoperability

Did this change mean that some websites are no longer globally accessible? No, the Punycode mapping I discuss above makes it all possible. The conversions between the ASCII and non-ASCII characters of a domain name are achieved using algorithms called ToASCII and ToUnicode. These aren’t applied to the domain name as a whole entity, but instead to individual labels within the domain. So, for example, if the domain name is Bücher.ch, then the two labels are Bücher, and ch. ToASCII is then applied to each of these separately.

Bücher is the German word for ‘books’, and .ch is the ccTLD of Switzerland – you’ll note that the ‘u’ with an umlaut in there isn’t an ASCII character. The second label ‘.ch’ consists purely of ASCII characters, and is left unchanged in the ToASCII process. The first label is then nameprepped to give bücher, and then Punycode is used to give us bcher-kva. It is then prepended with xn--, to produce xn--bcher-kva.ch. This then makes it suitable for use in the DNS. This all sounds very complex, and that’s because it is – but it should hopefully help you understand why it has taken so long to implement a system that allows domain names to contain no Latin-based characters at all. It is a minefield.

So, how has the world embraced this new found domain name freedom over the past twelve months? Let’s take a look.

IDN ccTLDs: A year on

At the time of writing, ICANN has evaluated 27 different countries/territories, constituting 38 different IDN strings, and India has 7 approved and delegated already, covering its various regional vernaculars. Only 20 of the 27 countries/territories have thus far been delegated to the DNS root zone, though the others should probably be delegated shortly.

Bytelevel Research produced a world map of IDNs positioned next to their equivalent ccTLDs, which was up-to-date as of March this year. So whilst it’s a little out of date now, it does help to illustrate the spread of countries that have taken to the new system.

IntDomainNames
EURid recently released a report examining the global use of IDNs. Early figures have indicated a strong public demand, particularly for the Russian .рф domain names which have reached 800,000 registrations as of March this year. The report says:

“The launch of the .рф TLD followed a standard landrush pattern, with a huge spike in registrations in November 2010, which then settled into a steady state. According to the Russian registrar, Reg.Ru, the landrush surpassed even their most optimistic forecasts.”

Registration rates of .рф are currently sitting at around 32,000 per month, which is still slightly less than the 40,000 for the Russian ASCII .ru, but it is still indicative of the growing popularity for full native script domains. The report continues:

“The rate of domain names in use, known as a delegation rate, is a good indicator of the impact of a new domain, because domain names that are being used are more likely to be renewed18. Whereas in November 2010, less than 10% of the .рф domains were delegated, this had grown to 50% by late January 2011 (compared with around 70% in .ru). This is a high level of use, especially considering that lack of email functionality limits the utility of IDNs.”

It’s worth noting that the increasing popularity of .рф is likely to be in part due to a concerted information and marketing campaign. But the EURid report also notes that a lot is down to user preference, with almost three quarters of users finding that .рф domain names are easier to spell and remember compared with Latin-based scripts.

In the case of the Saudi Arabian domain, the introduction of the IDN ccTLD has also had a beneficial effect on demand for the ASCII equivalent ccTLD, .sa. The landrush for السعودية . domains began on 27 September 2010, and according to the report:

“These domain names currently account for 7% of all registrations in Saudi Arabia. Using this percentage and the overall growth of the TLD to date, registrations for 2011 are projected to more than double the average registration rate (2006 to 2009) on the assumption that current registration rates remain constant.”

So using the Saudi Arabian IDN ccTLD as a benchmark, it seems there is still a lot of ground to cover if non-Latin based scripts are to properly catch on in domain names. But, as the report notes, “it seems that its launch has invigorated interest in Saudi domain names as a whole.” This means that the .sa ccTLD has also received a boost in number of people registering domains overall.

Furthermore, as John Yunker – co-producer of the annual Web Globalization Report Card, and author of Beyond Borders: Web Globalization Strategiesnotes:

“Arabic IDNs in particular face an uphill battle because web browsers offer poor (and inconsistent) support for them.”

So as the Web becomes increasingly multilingual, browsers will have to keep pace to ensure consistent representation of languages across the board – both in terms of Web content and URLs.

India…leading the way?

As Yunker notes, India (.in) is a particularly challenging domain because the country has more than 20 official languages covering multiple scripts. Here are the seven strings that have passed evaluation and been delegated to the DNS root zone so far:

Internationalized domain names (IDNs)
So India, it seems, is really powering ahead in its attempt to provide accessible Internet for its population. China, Singapore, Sri Lanka and Taiwan, too, have all had more than one IDN string evaluated and delegated.

Problems with IDNs

Using Unicode in domain names can cause problems, in that it’s easier to spoof websites by capitalizing on visual similarities between scripts – this is known as an IDN homograph attack. For example, Unicode character U+0430, Cyrillic small letter a, can look identical to Unicode character U+0061, Latin small letter a, used in English. Mozilla actually disabled IDN support by default back in 2005, to protect users from URL spoofing.

And it’s this potential for spoofing that has led to some IDN ccTLDs being rejected in the past year.

Rejections

Greece was rejected for .ελ, because it resembled .EA – which, incidentally isn’t being used as a ccTLD, but it is a two letter string in the ISO-3166 reserve list.

And Bulgaria was rejected by ICANN way back in May 2010, because its proposed ccTLD – .бг – was visually similar to the .br Brazilian ccTLD. So there are rules in place designed to help make the IDN ccTLD process run as smoothly as possible.

Other barriers

A survey in the EURid report found that 82% of participants highlighted that adding email functionality would improve IDN uptake. Email is a key part of domain functionality and as long as it is unavailable for IDNs their usefulness will be limited. This is the case for IDN ccTLDs. The report says:

“The email service on Cyrillic domains operates in the form name@домен.рф, although some hosting companies can already support the email service имя@домен.рф. Work on IDN email compatibility continues. The IETF is currently working on the standards for IDN email and it is anticipated that full email functionality will be available by the end of 2012. Advances in this area will show the extent to which the email issue has been the barrier to uptake of IDNs.

Other suggestions to improve IDN uptake were “full support by the mobile environment” and the “ability to use IDNs in all applications including WHO IS and web browsers”.

The future of IDN ccTLDs

The significance of the launch of IDN ccTLDs last year shouldn’t be underestimated, as it goes some way towards putting the ‘World’ into WWW.

The changes brought in were all about improving the accessibility of non-Latin based scripts in the Internet’s address system, but making the Web a truly multilingual place is a long and complex process.

It’s too early to say how successful this will be in the long run, but countries such as Russia are showing that there is real potential here to revolutionize the Internet. And with other powerhouses such as India and China also striving to fully localize their domain names, I’d say we’re well on the way to seeing a truly global Web. But there’s still a lot of work to be done.

Get the TNW newsletter

Get the most important tech news in your inbox each week.