Emojis in domain names, punycode and performance

Most domain names are encoded using ASCII (e.g., yahoo.com). However, you can register domain names with almost any character in them. For example, there is a web site at 💩.la called poopla. Yet the underlying infrastructure is basically pure ASCII. To make it work, the text of your domain is first translated into ASCII using a special encoding called ‘punycode‘. The poopla web site is actually at https://xn--ls8h.la/.

Punycode is a tricky format. Thankfully, domain names are made of labels (e.g., in microsoft.com, microsoft is a label) and each label can use at most 63 bytes. In total, a domain name cannot exceed 255 bytes, and that is after encoding it to punycode if necessary.

Some time ago, Colm MacCárthaigh suggested that I look at the performance impact of punycode. To answer the question, I quickly implemented a function that does the job. It is a single function without much fanfare. It is roughly derived from the code in the standard, but it looks simpler to me. Importantly, I do not claim that my implementation is particularly fast.

As a dataset, I use all of the examples from the wikipedia entry of punycode.

GCC 11, Intel Ice Lake 75 ns/string
Apple M2, LLVM 14 27 ns/string

The punycode format is relatively complicated, it was not designed for high speed, so there are limits to how fast one can go. Nevertheless, it should be possible to go faster. Yet it is probably ‘fast enough’ and it is unlikely that punycode encoding would ever be a performance bottleneck if only because domain names are almost always pure ASCII and punycode is unneeded.

 

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

Leave a Reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

You may subscribe to this blog by email.