Hosting a domain name with compound characters

08 June 2020 - By Stéphane Bortzmeyer

In a previous article, we saw that it is perfectly possible to use compound characters in a domain name. Examples of this are réussir-en.fr, académie-française.fr, and many others. These names are handled like any other domain name and can be used, for instance, online in URLs –or web addresses– such as http://réforme-retraites.gouv.fr/. For the end user, they are just like other domain names and have no distinguishing features, unless the software is very old or contains bugs. However, for a technician configuring the software behind the hosting of these names, and the associated services, this is not always the case, and often they have to be handled differently.

This article is therefore intended for those technicians, for example, the system administrator of a HTTP server used to serve a website whose domain name contains these compound characters (also called "diacritical characters" or "Unicode characters"). It focuses on free software, like Apache or Nginx, since they are essentially the basis for Internet services infrastructure.

Quick technical reminder

First, let's review very quickly these IDN (Internationalized Domain Names). The principle is that the DNS (Domain Name System) will only handle LDH (Letters-Digits-Hyphens) names, which may contain ASCII standard letters only, meaning that no compound characters are permitted. This was to ensure that old software would never contain compound characters. Names in Unicode (the technical term is U-label) are therefore encoded in Punycode, a coding that can be used to represent any name in LDH (the technical term is A-label). So, académie-française.fr (the U-label) will be represented in Punycode by xn—acadmie-franaise-npb1a.fr (the A-label). In an ideal world, the system administrator would only have to handle names in Unicode. But in the real world, many software programs require the administrator who configures them to use the Punycode format.

Fortunately, there are tools to facilitate the conversion from one format to another. GNU libidn2, for example, comes with a command line tool, idn2, which enables these conversions:

% echo académie-française.fr | idn2

xn—acadmie-franaise-npb1a.fr

% echo xn--ducation-90a.gouv.fr | idn2 -d

éducation.gouv.fr

Why the 2 at the end? Because this is now version 2 of the IDN standard. The earlier tool, simply called "idn", managed version 1. Problems may arise with names that behave differently with IDN version 1 and version 2. This is the case for the German ß (eszett), which is used in four .fr names:

% echo außensteckdose.fr | idn2

xn—auensteckdose-cdb.fr

% echo außensteckdose.fr | idn

aussensteckdose.fr

The ß was changed to "ss" in IDN 1, whereas it remains unchanged in IDN 2. Another example in which differences may occur is that of scripts that were not included in the Unicode standard until after the release of IDN version 1. This is the case with Tifinagh script, which simply does not work in IDN 1:

% echo "ⴰⵣⵓⵍ.bortzmeyer.fr" | idn2

xn--4lj0cra7d.bortzmeyer.fr

% echo "ⴰⵣⵓⵍ.bortzmeyer.fr" | idn

idn: idna_to_ascii_4z: String preparation failed

And there is another problem with the eszett, which is that the round-trip (i.e. translating the U-label into an A-label and then from an A-label into a U-label) is not possible in IDN version 1, where the A-label becomes a standard ASCII domain name. This explains, for example, the Python programming language error message "UnicodeError: ('IDNA does not round-trip', b'xn--auensteckdose-cdb', b'aussensteckdose')".

DNS

Now let's get to work and create these names. You can create them in a subdomain of an existing domain (like ⴰⵣⵓⵍ.bortzmeyer.fr above) or register them with a registry that accepts these characters (not all of them do, and if they do, not necessarily all the possible characters are accepted. Check with the registry).

In the first case, it all depends on the software you use to provision your domain names, which may or may not handle Unicode names well. A-labels must then be used. For example, if you edit a zone file with standard syntax directly, the DNS server will probably not accept Unicode and you will have to use the Punycode format in the zone file (hence the advantage of the idn2-type tools mentioned above). Below is an example, with a comment that shows the Unicode name:

; ⴰⵣⵓⵍ

xn--4lj0cra7d IN CNAME serveur.internautique.fr.

The DNS actually allows any characters in a domain name, and a Unicode name, with a UTF-8 type encoding, would probably be accepted as is by the server, which would confuse applications prompting them to convert it into Punycode.

If you register a name with a domain name registry, you will often go through a registrar. So it all depends on the registrar and its software. I’ve tested two major .fr registrars and in both cases, everything worked fine. The web interface lets me type and read names in their normal Unicode format, which is definitely more user-friendly than Punycode. Note that some registries will require you to indicate which script is used for the name and do not permit script mixes.

I’ve also tested the API of a major domain name registrar and was pleasantly surprised to see that IDNs were handled correctly. I was able to send Unicode (U-labels) and everything worked correctly.

If you host this domain name on your own name servers, this, again, will depend on the software used. You may be required to configure the name server using A-labels. And remember that, contrary to widely held myth, the DNS has in fact always allowed any characters, and is not restricted to the ASCII standard. If café.example is put in a zone file, the name server does not necessarily know whether the U-label should be translated into Punycode or kept as is. This is the second behaviour adopted by some servers, like BIND, which can cause surprises.

Once the name has been registered, several DNS clients manage the IDNs to query it. With the classic dig, in version 9.11:

 

The same applies for kdig, in version 2.7.6.

Drill (version 1.7.0), however, does not understand and does not manage the name correctly. It can be argued that this is a debugging tool, designed for computer engineers, and so it is not necessary for it to do the same as what can be done with a short Unix shell:

Lastly, other DNS clients are implemented in the form of a web page and, for example, the DNS Looking Glass manages the IDNs: see https://dns.bortzmeyer.org/réussir-en.fr.

Whois

You may also want to use other domain name-related services, such as whois. The GNU whois client has no problem with IDNs:

% whois potamochère.fr

%% This is the AFNIC Whois server.

domain: potamochère.fr

domain-ace: xn--potamochre-66a.fr d

omain-idn: potamochère.fr

registrar: GANDI

created: 2013-09-09T12:12:45Z

last-update: 2019-08-09T09:26:17Z

 

The same applies with other interfaces for finding information on a domain name, for example, via the Web (in this case, at Afnic), in which case names in Unicode are properly managed.

Web

Obviously, a domain name is not just created to insert information in the DNS. It is intended to be used for services, to create an online presence. Let's take the example of setting up a website. Again, the question of whether attractive U-labels (café-bien-serré.fr) can be used instead of unattractive A-labels (xn—caf-bien-serr-dhbk.fr) will depend on the software used. With Nginx (version 1.16.1), the Punycode format (xn—caf-bien-serr-dhbk.fr) must apparently be used in the server_name directive of the configuration file. Apache (version 2.4.38) allows you to use the normal Unicode format in the ServerName directive. The configuration file can be named with the Unicode name (e.g. www.potamochère.fr.conf), as Apache directives such as Include allow this.

But be wary of various utility programs and scripts written too quickly. The a2ensite script on the Debian operating system only works with LDH names (it does not permit non-ASCII characters). If the symbolic links required are available, there is no problem with Unicode. On the other hand, server directives, such as Redirect on Apache, require the A-label (Punycode) to be indicated, otherwise the Unicode is sent to the client, some of which, such as curl, do not understand the redirecting.

What about the web clients for testing? curl and wget cause no problems with Unicode:

 

curl prefers the IDN: even when using the -v (verbose) option, curl continues to display the Unicode format of the name, which is not the case with wget.

Note that all HTTP clients send the name in Punycode format in the Host: header. This doesn't matter as this HTTP dialogue is not seen by users directly. Incidentally, note that it is hard to rely on technical standards to know what should appear in the Host: header, as they are very complex in this respect.

And what about monitoring software like Nagios or Icinga? The monitoring plugin check_http parameters require Punycode. If another encoding is used, it is not processed and is sent as is, which typically causes an HTTP 400 error (invalid request).

Certificates

What about certificate requests? If you want your website to be authenticated, you need a certificate for your domain name and, depending on the certification authority you use, you will have to request it in the "normal" format (académie-française.fr) or the Punycode format (acadmie-franaise-npb1a.fr). Note that, in the case of the example provided, académie-française.fr, it seems there is no certificate, but I expect there will be one day.

For example, the certification authority Let’s Encrypt does not permit Unicode names ("Domain name contains an invalid character"). Everything must be in Punycode.

The same applies for some very useful services when handling certificates such as crt.sh, a web interface for accessing Certificate Transparency service logs. If Unicode is entered, crt.sh simply indicates that it did not find a certificate, and it should have been in Punycode.

Email

The problem with email is different to that of the Web. It is an older technology and, since there is no end-to-end communication, it is more difficult to negotiate with your correspondent. Note also that there are two separate problems in email addresses, one for the local part of the name (stéphane, in the hypothetical address stéphane@internet-en-coopération.fr), and one for the domain name. Punycode only applies to the domain name.

The general framework for email addresses in Unicode is called EAI, which stands for Email Addresses Internationalization, and has been standardised since 2012. But in practice, it has to be said that it is not very reliable: not many software programs are configured to handle it, and there is little possibility of your IDN domain being used for email. As the web interface of a registrar states when registering an IDN domain, "Please note that email addresses may not work with a domain name containing one or more special characters [sic]".

Conclusion

Ideally the system administrator, like the ordinary user, could handle normal Unicode and never see the Punycode format containing xn--. But this is clearly not the case today, and various reasons relating to Internet inertia and the need not to break pre-existing habits mean that in practice, we need Punycode and must be prepared to see and handle it.

Lire cette ressource en français Top of the page