Originally, the domain part of a web address was all ASCII (so no accents, no emojis, no Chinese characters). This was extended a long time ago thanks to something called internationalized domain name (IDN).
Today, in theory, you can use any Unicode character you like as part of a domain name, including emojis. Whether that is wise is something else.
What does the standard says? Given a domain name, we should identify its labels. They are normally separated by dots (.) into labels: www.microsoft.com has three labels. But you may also use other Unicode characters as separators ( ., ．, 。, ｡). Each label is further processed. If it is all ASCII, then it is left as is. Otherwise, we must convert it to an ASCII code called “punycode” after doing the following according to RFC 3454:
- Map characters (section 3 of RFC 3454),
- Normalize (section 4 of RFC 3454),
- Reject forbidden characters,
- Optionally reject based on unassigned code points (section 7).
And then you get to the punycode algorithm. There are further conditions to be satisfied, such as the domain name in ASCII cannot exceed 255 bytes.
That’s quite a lot of work. The goal is to transcribe each Unicode domain name into an ASCII domain name. You would hope that it would be a well-defined algorithm: given a Unicode domain name, there should be a unique output.
Let us choose a common non-ASCII character, the letter ß, called Eszett. Let me create a link with this character:
What happens if you click on this link? The result depends on your browser. If you are using Microsoft Edge, Google Chrome or the Brave browser, you may end up at https://messagefactory.ca/. If you are using Safari or Firefox you may end up at https://xn--meagefactory-m9a.ca. Of course, your results may vary depending on your exact system. Under ios (iPhone), I expect that the Safari behaviour will prevail irrespective of your browser.
Not what I expected.
Update: We wrote our own library to process international domain names according to the standard: it is called idna and part of the ada-url project. Our library produces https://xn--meagefactory-m9a.ca which is the non-transitional (and now correct) answer.
5 thoughts on “International domain names: where does https://meßagefactory.ca lead you?”
Address not found? The link seems to work correctly but doesn’t lead anywhere. Tested on Firefox with my Android phone. Jan 2023.
Is there any real URL that uses these characters? I have not seen one. I have a suspicion that there aren’t any websites at all that use such characters but I don’t know how I’d go about finding them.
Try the German Wordle Game. You find it under Wördle.de
Of course, Microsoft Edge, Google Chrome and Brave are all based on Chromium. And on iOS, no matter which app you use, WebKit is the engine actually responsible for browsing.
“Whether that is wise is something else.”
Actually, it is not something else. It is the whole point of your blog post. The fact that different browsers do not follow the same standard shows the URL is not a wise choice from a marketing or functioning standpoint.
It would be odd for that address to resolve at all! A .ca domain name, AFAIK, can only be the ASCII letters a-z, digits 0-9, hyphen (-), or one of the French accented letters é, ë, ê, è, â, à, æ, ô, œ, ù, û, ü, ç, î, ï, ÿ. This was set by the Canadian Internet Registration Authority (CIRA), which administrates the .ca domain, back in 2012.
You may subscribe to this blog by email.