Coding of domain names to wire format at gigabytes per second

When you enter in your browser the domain name lemire.me, it eventually gets encoded into a so-called wire format. The name lemire.me contains two labels, one of length 6 (lemire) and one of length two (me). The wire format starts with 6lemire2me: that is, imagining that the name starts with an imaginary dot, all dots are replaced by the length (in bytes) of the upcoming label. The numbers are stored as byte values, not ASCII digits. There is also a final zero byte. RFC 1035 specifies the format:

Each label is represented as a one octet length field followed by that number of octets. Since every domain name ends with the null label of the root, a domain name is terminated by a length byte of zero. The high order two bits of every length octet must be zero, and the remaining six bits of the length field limit the label to 63 octets or less. To simplify implementations, the total length of a domain name (i.e., label octets and label length octets) is restricted to 255 octets or less.

I find the problem interesting because the wire format is an “indexed” format: you can easily iterate over the labels, jumping in constant time over them, knowing the exact length, and so forth. Thus, coding strings into the wire format is a form of indexing.

How quickly can we do it?

You can use conventional code to convert the name to the wire format. You copy non-dot characters and when you encounter a dot, you write the distance to the previous dot (imagining that there is a virtual first dot). In C, it might look as follows (omitting the final zero byte):

 char *src = "lemire.me";
 uint8_t *dst = ...; // buffer
 uint8_t *counter = dst++;
 do {
   while (is_name_character(*src)) {
    *dst++ = (uint8_t)*src; src++;
  }
  *counter = (uint8_t)(dst - counter - 1);
  counter = dst++;
  if (*src != '.') { break; }
  src++;
 } while (true);

Can you do better?

You can use Single instruction, multiple data (SIMD) instructions. You load 16, 32 or 64 characters in wide register. You locate the dots. You construct a new register where the dots are replaced by position indexes and everything else is zeroed. E.g., starting with

.a.bcde.

You mark the location of the dots:

0 0 0xff 0 0 0 0 0xff

You compute the byte-wise AND with the following constant:

0 1 2 3 4 5 6 7

and you get an array where the dots have been replaced by indexes…

0 0 2 0 0 0 0 7

You then do a prefix computation, propagating the values:

0 2 2 7 7 7 7 7

This can be done using a logarithmic algorithm: shifting the values off by 1 byte, comparing, shifting the values off by 2 bytes, comparing, and so forth.

Finally, we can shift by one and subtract the result. Masking off the result so that we are only keeping where the dots are, we get the desired counts:

2 - 5 - - - - 0

You can then blend the results:

2 a 5 b c d e 0

If you use Intel intrinsics, the code might look like this… It is a bit technical so I am not going to explain it further, but the idea is as I explained…

__m128i dot_to_counters(__m128i input) {
  __m128i dots = _mm_cmpeq_epi8(input, _mm_set1_epi8('.'));
  __m128i sequential =
_mm_setr_epi8(-128, -127, -126, -125, -124, -123, -122, -121, -120, -119,
-118, -117, -116, -115, -114, -113);
  __m128i dotscounts = _mm_and_si128(dots, sequential);
  __m128i origdotscounts = dotscounts;
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 1),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 2),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 4),
    dotscounts);
  dotscounts = _mm_min_epi8(_mm_alignr_epi8(zero, dotscounts, 8),
    dotscounts);
  __m128i next = _mm_alignr_epi8(_mm_setzero_si128(), dotscounts, 1);
  dotscounts = _mm_subs_epu8(next, origdotscounts);
  dotscounts = _mm_subs_epu8(dotscounts, _mm_set1_epi8(1));
  return _mm_blendv_epi8(input, dotscounts, dots);
}

I call this strategy  “Prefix-Minimum”. It is essentially data independent. It does not matter where the dots are, the code is always the same. Processors like that a lot!

Prefix-Minimum will be fast, especially if you use wide registers (e.g., 32 bytes with AVX2 instructions). However, it does not offer a direct path to validating the inputs: are the counters in the proper range? Do you have overly long labels?

If you need to reason about where the dots are, the best way is probably to build an index in the spirit of the simdjson library (our original paper is available as PDF). For example, you can load 64 bytes of data, and construct a 64-bit word where each bit correspond to a byte value, and the byte is set to 1 if the corresponding byte is a dot (‘.’), you can then iterate through the bits, as a bitset. The construction of the 64-bit word might look as follows using Intel instructions and AVX2:

__m256i input1 = _mm256_loadu_si256((const __m256i *)src);
__m256i input2 = _mm256_loadu_si256((const __m256i *)(src + 32));
uint32_t dots1 = (uint32_t)_mm256_movemask_epi8(
_mm256_cmpeq_epi8(input1, _mm256_set1_epi8('.')));
uint32_t dots2 = (uint32_t)_mm256_movemask_epi8(
_mm256_cmpeq_epi8(input2, _mm256_set1_epi8('.')));
uint64_t DOTS = (((uint64_t)dots2) << 32) | dots1;

I implemented different strategies using AVX2 intrinsics. We can use a database of popular domain names for testing. Results vary depending on the system. I use GCC 11 and an Ice Lake server.

technique CPU cycles/string instructions/string
conventional 150 240
Prefix-Minimum (256 bits) 44 77
simdjson-like (256 bits) 56 77

My source code is available.

The Prefix-Minimum approach is fastest. Observe how the Prefix-Minimum approach is faster than the simdjson-like approach, despite executing the same number of instructions. That is because it is able to retire more instructions per cycle. The downside of the Prefix-Minimum approach is that it does not validate the label length.

The simdjson-like approach does full validation and is still quite fast: over three times faster than the conventional approach. It is sufficient to break the gigabyte per second barrier (1.7 GB/s in these experiments).

Being able to validate the counter values with the Prefix-Minimum without adding too much overhead would be fantastic: it might be possible to get the best speed without compromising on the validation. It remains an open problem whether it is possible.

I suspect that my implementations can be improved, by at least 20%.

Future work should consider AVX-512. I have the sketch of an implementation using AVX-512 but it is not faster. Porting to ARM NEON would be interesting.

Credit: The work was motivated by the simdzone project by NLnetLabs.

Daniel Lemire, "Coding of domain names to wire format at gigabytes per second," in Daniel Lemire's blog, August 10, 2023.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “Coding of domain names to wire format at gigabytes per second”

  1. Back when we were trying to validate UTF-8, I wracked my brain on how to get correct jump distances like this.

    The trick is to exponentiate a permutation of the following form:

    P[i] = i if position i has a dot

    P[i] = i+1 if not

    In SIMD terms that’s a constant vector plus the comparison mask:

    P=
    1 2 3 4 5 6 7 8 + -1 0 -1 0 0 0 0 -1
    = 0 2 2 4 5 6 7 7

    P^2 = 0 2 2 5 6 7 7 7 7

    P^4 = 0 2 2 7 7 7 7 7

    P has a fixed point at each dot, and the other locations fill from the right. The limit is an eigenvector under P.

      1. P^2 means pshufb(P,P); sorry for the rough notation. That’s the shuffle mask equivalent to applying P twice (shuffle masks are composable, subject to certain constraints such as out-of-range entries being replaced by 0).

  2. Only a minor remark, but the wire format for domain names is slightly different. Given lemire.me, the name contains three labels, one of length 6 (lemire), one of length 2 (me) and the root label. The correct wire format is 6lemire2me0 (in URLs the trailing dot is often implicit).

    Additional background info: A resolver starts looking up information by first querying the . (root server), which responds with the name server for .me, which responds with the location for lemire.me..

    1. Thanks. The blog post does not describe the final zero. However, the code does produce a trailing zero. In the case where the input is lemire.me, you get the following (see unit tests):

      input:
      6c 65 6d 69 72 65 2e 6d 65 
      name_to_dnswire_simd output:
      06 6c 65 6d 69 72 65 02 6d 65 00 
      name_to_dnswire output:
      06 6c 65 6d 69 72 65 02 6d 65 00 
      name_to_dnswire_scalar_labels output:
      06 6c 65 6d 69 72 65 02 6d 65 00 
      name_to_dnswire_avx output:
      06 6c 65 6d 69 72 65 02 6d 65 00 
      name_to_dnswire_idx_avx output:
      06 6c 65 6d 69 72 65 02 6d 65 00
      
      1. The non-simd function works in the test because it has a final *counter = 0; whereas the simplified code in the article excludes it and is wrong as a result.

        I sympathize with excluding the label length validation and is_name_char implementation, both clearly for brevity’s sake, but the null terminator is a matter of correctness.

        I (and presumably Jeroen too) would have found your introduction easier to understand since I think of the DNS name wire format the other way (dots are separators between labels, there’s a trailing empty label, labels are serialized with a length prefix byte). From your description, I figured you’d have to add the null terminator after the loop, so when I didn’t see it in your code, I mentally desk checked it to make sure I wasn’t misunderstanding.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.