A brief example of using Afnic Open Data

09 July 2019 - By Stéphane Bortzmeyer

 

Which domain names are derived from a first name?

Afnic distributes open data in https://opendata.afnic.fr/en/, about .fr domain names. Here is a brief example of the use of these data, crossed with other open data, on first names in France.

Domain name registrants have a vast choice of names. They can derive the domain name from their family name, or choose a descriptive name. If Jean Dupont wants to create a website on gardening, he can choose jean-dupont.fr, or dupont-jardinage.fr or jean-jardinage.fr or a wide range of other names. We shall focus here on domain names based on a first name.

The first question is how to find them? The list of .fr domain names is available in https://opendata.afnic.fr/en. Download "A- domain names fr.zip" (I won't give you the link, it changes every time), unzip the file, and you end up with a file in CSV format (in fact, the fields are separated by semicolons, not commas), the two fields important to us being the first (the domain name) and the 11th (the date of deletion: if the field is input, it means the domain no longer exists). So we now have the list of domain names under the .fr. We still have to find those that derive from a first name. (We also need to recode the file in UTF-8 because it uses an old character encoding system.)

Is there a list of first names in France, like a list of domain names? Yes, the French National Institute for Statistics and Economic Studies (INSEE) distributes such a list. We also get a zipped file that, once unzipped, gives us a list of first names. It is recommended to read the documentation because the use of this file is a bit complicated. A first analysis shows that the file contains 32,704 first names. Now let's look for which domain names are first names.

A first trivial program tells us 13,518 domain names have been formed in this way, among which the classics marie.fr and jean.fr but also my first name (stéphane.fr exists), as well as brunehilde.fr and lucrezia.fr. But this is insufficient because the program only detects the domain names that are first names. We should like to expand the search and have domain names comprising a first name.

I'll spoil the surprise immediately: it won't work well because many of the names are so short that they are found everywhere. The INSEE file includes names such as Al or Bo, but also single letters (negligence of the town clerk?). So we have to reduce the list of first names. Let's start by keeping only the most common ones; some first names are very rare. (The popularity of first names is a decreasing exponential). By accepting only the first names given to more than 1,000 people during the period in question, we reduce the number of names to 3,042 but, and this is what is important, it still represents 93.8% of the population.

This time, we find too many domain names: 34.42 %. This is due to the fact that there are still names which are quite short, which create many false positives. If lejardindelola.fr contains the first name Lola, on the other hand service-catholique-funerailles-boulogne-billancourt.fr is a false positive (it contains the first name Illan). In short, we shall have to move to a more subtle algorithm.

Next step, not only do we keep only the 3042 most frequent first names used in the previous test, but we consider a domain name is derived from a first name only if one of the following conditions is met:

  • the domain name is equal to a first name (michèle.fr),
  • the first name is more than six letters long and is at the beginning of the domain name (charlesdegaulleroissyparkingaeroport.fr), 
  • the domain name begins with a first name less than six letters long, and is followed by a dash (zora-creation.fr). 

With these rules, we find that 147,094 domain names, or 4.31% of the total are derived from a first name. There are still false negatives and false positives (like france-boissons.fr, where the first word probably refers to the country and not the first name) but nothing is perfect in data analysis.

Note that there are still some things that could be improved. I did not try to do fuzzy search, for example, so the name Théophile will not be found in theophile.fr. (The INSEE data are of variable quality in terms of spelling; for example, this particular name is sometimes written Théophile and sometimes Theophile.) Another trap, first names are highly fashionable and the INSEE database dates back to 1900. It might be interesting not to take into account first names only given in the past.

And so now we can now start studying the history of these domain names based on a first name: do they have a better renewal rate than others, for example. But I focused here on what was available as open data.

Thanks to Alexander Mayrhofer, from the .at registry (Austria) for the idea, the explanations and the algorithm. The rest only concerns programmers:

  • The programs were written in Python. 
  • The names in the database distributed by Afnic are encoded in Punycode (for example, stéphane.fr is written xn--stphane-cya.fr). To have the real name, you have to convert them encodings.idna.ToUnicode (domain). 
  • The trivial algorithm for testing all first names with all domain names nests both loops. This is obviously dreadfully inefficient, so I used regular expressions with the Python re module. This builds an expression with all the first names and is applied successively to each domain.

 

Lire cette ressource en français Top of the page