Fast base16 encoding

Given binary data, we often need to encode it as ASCII text. Email and much of the web effectively works in this manner.

A popular format for this purpose is base64. With Muła, we showed that we could achieve excellent speed using vector instructions on commodity processors (2018, 2020). However, base64 is a bit tricky.

A much simpler format is just base16. E.g., you just transcribe each byte into two bytes representing the value in hexadecimal notation. Thus the byte value 1 becomes the two bytes ’01’. The byte value 255 becomes ‘FF’, and so forth. In other words, you use one byte (or one character) per ‘nibble’: a byte is made of two nibbles: the most-significant 4 bits and the least-significant 4 bits.

How could encode base16 quickly? A reasonable approach might be to use a table. You grab one byte from the input and you directly lookup the 2 bytes from the output which you immediately write out:

void encode_scalar(const uint8_t *source, size_t len, char *target) {
  const uint16_t table[] = {
      0x3030, 0x3130, 0x3230, 0x3330, 0x3430, ...
      0x6366, 0x6466, 0x6566, 0x6666};
  for (size_t i = 0; i < len; i++) {
    uint16_t code = table[source[i]];
    ::memcpy(target, &code, 2);
    target += 2;
  }
}

It requires a 512-byte table but that is not concerning.

Could we do better?

Milosz Krajewski wrote some good-looking C# code using vector instructions. I wrote something that should be the equivalent using x64 C++. We have both routines for 128-bit and 256-bit vectors. My code is for demonstration purposes but it is essentially correct.

The core idea is not complicated. You must grab a vector of bytes. Then you must somehow expand it out: each nibble must go into a byte. And then the magic is this: we use the fast vectorized lookup (e.g., the pshufb instruction) to look up each nibble into a 16-byte table containing the letters ‘0’, ‘1’…’a’, …’f’.

Here is the 128-bit code using Intel intrinsics:

  __m128i shuf = _mm_set_epi8('f', 'e', 'd', 'c', 'b', 'a', '9', '8', '7', '6',
                              '5', '4', '3', '2', '1', '0');
  size_t i = 0;
  __m128i maskf = _mm_set1_epi8(0xf);
  for (; i + 16 <= len; i += 16) {
    __m128i input = _mm_loadu_si128((const __m128i *)(source + i));
    __m128i inputbase = _mm_and_si128(maskf, input);
    __m128i inputs4 =
        _mm_and_si128(maskf, _mm_srli_epi16(input, 4));
    __m128i firstpart = _mm_unpacklo_epi8(inputs4, inputbase);
    __m128i output1 = _mm_shuffle_epi8(shuf, firstpart);
    __m128i secondpart = _mm_unpackhi_epi8(inputs4, inputbase);
    __m128i output2 = _mm_shuffle_epi8(shuf, secondpart);
    _mm_storeu_si128((__m128i *)(target), output1);
    target += 16;
    _mm_storeu_si128((__m128i *)(target), output2);
    target += 16;
  }

The 256-bit code is roughly the same, with one extra instruction to shuffle the bytes to compensate for the fact that 256-bit instructions work ‘per lane’ (in units of 128-bit words). My source code is available.

We might therefore expect the 256-bit to be maybe twice as fast? My results on an icelake processor with GCC11:

table lookup 0.9 GB/s
128-bit vectors 6.4 GB/s
256-bit vectors 11 GB/s

We are not quite twice as fast, but close enough. I do not find these speeds very satisfying: I expect that less naive code could be faster.

Milosz gets much poorer results in C#: the 256-bit code is barely faster than the 128-bit code, but he does some relatively complicated computation in the 256-bit code instead of just calling the 256-bit shuffle instruction (vpshufb). (I suspect that he will soon fix this issue if he can.)

Our code would work on ARM as well after minor changes. For AVX-512 or SVE, we may need different approaches. We could add both encoding and decoding.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “Fast base16 encoding”

  1. May be worth tossing in a comparison with the ‘traditional’ approach, that is roughly:

    void encode_avx2(const uint8_t *restrict source, size_t len, char *restrict target) {
    __m256i lomask = _mm256_set1_epi8(0x0F);
    __m256i himask = _mm256_set1_epi8(0xF0);
    __m256i letter = _mm256_set1_epi8(9);
    __m256i numberdiff = _mm256_set1_epi8(‘0’);
    __m256i letterdiff = _mm256_set1_epi8(‘a’ – 10);

    for (size_t i = 0; i + 32 <= len; i += 32) {
    __m256i input = _mm256_loadu_si256((const __m256i *)(source + i));
    __m256i sinput = _mm256_permute4x64_epi64(input, 0b11011000);

    __m256i hs = _mm256_and_si256(sinput, himask);
    __m256i hi = _mm256_srli_epi16(hs, 4);
    __m256i lo = _mm256_and_si256(sinput, lomask);

    __m256i loletter = _mm256_cmpgt_epi8(lo, letter);
    __m256i hiletter = _mm256_cmpgt_epi8(hi, letter);

    __m256i looffset = _mm256_blendv_epi8(numberdiff, letterdiff, loletter);
    __m256i hioffset = _mm256_blendv_epi8(numberdiff, letterdiff, hiletter);

    __m256i loout = _mm256_add_epi8(lo, looffset);
    __m256i hiout = _mm256_add_epi8(hi, hioffset);

    __m256i firstt = _mm256_unpacklo_epi8(loout, hiout);
    __m256i second = _mm256_unpackhi_epi8(loout, hiout);
    _mm256_storeu_si256((__m256i *)(target), firstt);
    target += 32;
    _mm256_storeu_si256((__m256i *)(target), second);
    target += 32;
    }
    }

    (Warning: completely untested.)

    This is just a fairly rote translation of:

    uint8_t lo = input & 0xF;
    uint8_t hi = (input >> 4) & 0xF;
    lo += (lo > 9) ? (‘a’ – 10) : ‘0’;
    hi += (hi > 9) ? (‘a’ – 10) : ‘0’;

    The shuffle approach should be faster. 2x vpshufb versus 2x vpcmpgtb and 2x vpblendvb and 2x vpaddb.

  2. Man, that memcpy looks criminal there. Why not have your table of shorts?
    Then you could write:
    for (; len; len–) {
    *target++ = table[*source++];
    }

    Don’t be shy with pointer arithmetic, either. Most likely the compiler would do the right thing for you, but you can never know.

    And then people doubtful of compilers go further and unroll the loop. Sometimes is makes a difference, sometimes it doesn’t.

    And still this is not half of what you could do.

    Compare the SIMD code with the best non SIMD code you can write.
    Otherwise, what’s the point?

    1. We want to store the content to a string. The string is not (in my mind) always aligned on a two-byte boundary. Now, if it is, your code is indeed simpler. But the code from the blog post won’t trigger undefined behaviour.

      Is your code faster ? You seem to think so, but you do not report on any benchmark. I remind you that I publish my code, so you can modify it and see whether you can make it go faster.

      Of course, which compiler you use matters, and which settings.

      Using GCC with -O3, your code compiles to :

      encode_scalar_short(unsigned char const*, unsigned long, unsigned short*):
              movdqa  .LC0(%rip), %xmm0
              movl    $26214, %eax
              movw    %ax, -24(%rsp)
              movaps  %xmm0, -40(%rsp)
              testq   %rsi, %rsi
              je      .L1
              xorl    %eax, %eax
      .L3:
              movzbl  (%rdi,%rax), %ecx
              movzwl  -40(%rsp,%rcx,2), %ecx
              movw    %cx, (%rdx,%rax,2)
              addq    $1, %rax
              cmpq    %rsi, %rax
              jne     .L3
      .L1:
              ret
      

      GCC compiles the code from the blog post to

      encode_scalar(unsigned char const*, unsigned long, char*):
              movdqa  .LC0(%rip), %xmm0
              movl    $26214, %eax
              movw    %ax, -24(%rsp)
              movaps  %xmm0, -40(%rsp)
              testq   %rsi, %rsi
              je      .L9
              xorl    %eax, %eax
      .L11:
              movzbl  (%rdi,%rax), %ecx
              movzwl  -40(%rsp,%rcx,2), %ecx
              movw    %cx, (%rdx,%rax,2)
              addq    $1, %rax
              cmpq    %rax, %rsi
              jne     .L11
      .L9:
              ret
      

      It looks very similar to me.

      It also looks like what you’d expect. Load the byte value, look up the short, store the short, increment, check if we are at the end of the loop.

      Unrolling might reduce the instruction count, but GCC is not eager to unroll, even with -O3. I would also not be tempted to unroll.

      So I think that the code from the blog post is decent and high performance.

  3. In this case, the memcpy was nicely optimized away. When that does not happen, all bets are off.

    On my system (i5-6200U CPU @ 2.30GHz from 2015)

    for (size_t i = 0; i >= 2;
    for (; len; len–) {
    *target++ = table[*source++];
    *target++ = table[*source++];
    *target++ = table[*source++];
    *target++ = table[*source++];
    }
    for (; i; i–) {
    *target++ = table[*source++];
    }

    runs in .45s.

    admittedly, one never knows when the loop unrolling pays off.

  4. I don’t know what happened with my earlier reply.
    The site made a mess of it.

    this:

    for (size_t i = 0; i >= 2;
    for (; len; len–) {
    *target++ = table[*source++];
    *target++ = table[*source++];
    *target++ = table[*source++];
    *target++ = table[*source++];
    }
    for (; i; i–) {
    *target++ = table[*source++];
    }

    runs in .45s.

  5. Hmm. Same thing again. Some text is removed in the middle.
    Strange bug. Maybe there’s some character it doesn’t like.

    for (size_t i = 0; i >= 2;
    for (; len; len–) {
    *target++ = table[*source++];
    *target++ = table[*source++];
    *target++ = table[*source++];
    *target++ = table[*source++];
    }
    for (; i; i–) {
    *target++ = table[*source++];
    }

    With target being short *.

Leave a Reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

You may subscribe to this blog by email.