How quickly can you remove spaces from a string?

Sometimes programmers want to prune out characters from a string of characters. For example, maybe you want to remove all line-ending characters from a piece of text.

Let me consider the problem where I want to remove all spaces (‘ ‘) and linefeed characters (‘\n’ and ‘\r’).

How would you do it efficiently?

size_t despace(char * bytes, size_t howmany) {
  size_t pos = 0;
  for(size_t i = 0; i < howmany; i++) {
      char c = bytes[i];
      if (c == '\r' || c == '\n' || c == ' ') {
        continue;
      }
      bytes[pos++] = c;
  }
  return pos;
}

This code will work on all UTF-8 encoded strings… which is the bulk of the strings found on the Internet if you consider that UTF-8 is a superset of ASCII.

That’s simple and should be fast… I had a blast looking at how various compilers process this code. It ends up being a handful of instructions per processed byte.

But we are processing bytes one by one while our processors have a 64-bit architecture. Can we process the data by units of 64-bit words?

There is a somewhat mysterious bit-twiddling expression that returns true whenever your word contains a zero byte:

(((v)-UINT64_C(0x0101010101010101)) & ~(v)&UINT64_C(0x8080808080808080))

All we need to know is that it works. With this tool, we can write a faster function…

uint64_t mask1 = ~UINT64_C(0) / 255 * (uint64_t)('\r');
uint64_t mask2 = ~UINT64_C(0) / 255 * (uint64_t)('\n');
uint64_t mask3 = ~UINT64_C(0) / 255 * (uint64_t)(' ');

for (; i + 7 < howmany; i += 8) {
    memcpy(&word, bytes + i, sizeof(word));
    uint64_t xor1 = word ^ mask1;
    uint64_t xor2 = word ^ mask2;
    uint64_t xor3 = word ^ mask3;

    if (haszero(xor1) || haszero(xor2) || haszero(xor3)) {
      // check each of the eight bytes by hand?
    } else {
      memmove(bytes + pos, bytes + i, sizeof(word));
      pos += 8;
    }
}

It is going to be faster as long as most blocks of eight characters do not contain any white space. When this occurs, we are basically copying 64-bit words one after the other, along with a moderately expensive check that our superscalar processors can do quickly.

As it turns out, we can better without using 64-bit words.

// table with zeros at indexes corresponding  to white spaces
int jump_table[256] = { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
  0, 1, 1, 0, 1,  ...};

size_t faster_despace( char* bytes, size_t howmany ) {
  size_t i = 0, pos = 0;
  while( i < howmany )
  {
    bytes[pos] = bytes[i++];
    pos += jump_table[ (unsigned char)bytes[pos] ];
  }
  return pos;
}

This approach proposed by Robin Leffmann is very clever, and it is fast because it avoids penalties due to mispredicted branches.

Can we do even better? Sure! Ever since the Pentium 4 (in 2001), we have had 128-bit (SIMD) instructions.

Let us solve the same problem with these nifty 128-bit SSE instructions, using the (ugly?) intel intrinsics…

__m128i spaces = _mm_set1_epi8(' ');
__m128i newline = _mm_set1_epi8('\n');
__m128i carriage = _mm_set1_epi8('\r');
size_t i = 0;
for (; i + 15 < howmany; i += 16) {
    __m128i x = _mm_loadu_si128((const __m128i *)(bytes + i));
    __m128i xspaces = _mm_cmpeq_epi8(x, spaces);
    __m128i xnewline = _mm_cmpeq_epi8(x, newline);
    __m128i xcarriage = _mm_cmpeq_epi8(x, carriage);
    __m128i anywhite = _mm_or_si128(_mm_or_si128(xspaces, 
                xnewline), xcarriage);
    int mask16 = _mm_movemask_epi8(anywhite);
    x = _mm_shuffle_epi8(
          x, _mm_loadu_si128(despace_mask16 + mask16));
    _mm_storeu_si128((__m128i *)(bytes + pos), x);
    pos += 16 - _mm_popcnt_u32(mask16);
}

The code is fairly straight-forward if you are familiar with SIMD instructions on Intel processors. I have made no effort to optimize it… so it is possible, even likely, that we could make it run faster. My original SIMD code had a branch but Nathan Kurz realized that it was best to simplify the code and remove it.

Let us see how fast it runs!

I designed a benchmark using a recent (Skylake) Intel processor over text entries where only a few characters are white space.

regular code5.5 cycles / byte
using 64-bit words 2.56 cycles/byte
with jump table 1.7 cycles/byte
SIMD (128 bits) code0.39 cycles / byte
memcpy0.08 cycles / byte

So the vectorized code is nearly 14 times faster than the regular code. That’s pretty good.

Yet pruning a few spaces is 5 times slower than copying the data with memcpy. So maybe we can go even faster. How fast could we be?

One hint: Our Intel processors can actually process 256-bit registers (with AVX/AVX2 instructions), so it is possible we could go twice as fast. Sadly, 256-bit SIMD instructions on x64 processors work on two 128-bit independent lanes which make the algorithmic design more painful.

Leffmann’s approach is not as fast as SIMD instructions, but it is more general and portable… and it is still three times faster than the regular code!

My C code is available.

Further reading: Quickly pruning elements in SIMD vectors using the simdprune library

39 thoughts on “How quickly can you remove spaces from a string?”

  1. I’m interested in what the benchmark is for this

    size_t despace(char * bytes, size_t howmany) {
    char* src = bytes;
    char* dest = bytes;

    for(size_t i = 0; i < howmany; i++) {
    char c = *src++;

    if (c == '\r' || c == '\n' || c == ' ') {
    continue;
    }
    *dest++ = c;
    }
    return dest – bytes;
    }

  2. Nice demo, but why would it be foolish to expect it to be 16x faster? Have you already lost faith in the power of magic vector dust? Relative to memcpy, what do you think the fundamental limit should be here?

    I may be doing something wrong, and I realize it isn’t exactly the point of your illustration, but I get about a 2x speedup if I remove the “if (mask16 == 0)” conditional from your code to make it branchless.

    [email protected]:~/git/despacer$ despacebenchmark
    sse4_despace(buffer, N): 0.818359 cycles / ops
    sse4_despace_branchless(buffer, N): 0.400391 cycles / ops

    As you’d expect, the advantage of avoiding the branchless approach grows if you increase the probability of spaces (.94 vs .40 at 2%), and decreases as spaces become rarer (.47 vs .40 at .5%).

    I wouldn’t assume that loop unrolling would have a significant effect here, although I also wouldn’t assume without looking that the compiler isn’t already doing it.

    It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.

    1. Nice demo, but why would it be foolish to expect it to be 16x faster?

      I prefer to be pessimistic so that I am rarely disappointed.

      It’s possible that you could get a further small boost by doing a lookup for the length to advance rather than a popcount, but I’m not sure what the tradeoff of latency vs instruction is going to be here.

      Yes. Good idea.

  3. “It is going to be faster as long as most blocks of eight characters do not contain any white space.”

    This doesn’t seem very true for most data sets.

    Also, I wouldn’t spend all that effort on bitwise checks and register copies, and instead focus on your basic algorithm. You are copying every byte twice. (Unless you are relying on optimization to magically fix that, but that sort of dodges the whole point of coding faster algorithms, in my opinion.) A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.
    I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.

    1. You are copying every byte twice.

      I am not sure what you mean? In the simple scalar code, I read each byte once, and I write most of them back.

      A better approach would be to scan for the beginning and end of blocks of text, and then move the whole block with an efficient overlap-safe move operation.

      I did consider an implementation based on memmove, but you will need many memmove calls in general.

      I did a simple comparison in perl, and scan and move was 20-25% faster than byte-at-a-time. There will certainly be language differences that may mean you get a different result in C, but these will mostly depend on hidden optimizations that shouldn’t matter for the purpose of finding the fastest algorithm.

      I’d very much like to see your code.

      1. When I say you are copying every byte twice, I mean this:
        char c = bytes[i];

        bytes[pos++] = c;
        Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.

        Here’s the perl comparison:
        —-
        #!/usr/bin/perl

        $text=join(” “,@ARGV);

        $howmany=length($text);

        print “lemire: {“, despace_lemire($text,$howmany), “} “;
        $t=time;
        for ($n=0; $n<1000000; ++$n) {
        despace_lemire($text,$howmany);
        }
        print time-$t, "\n";

        print "mine: {", despace_mine($text,$howmany), "} ";
        $t=time;
        for ($n=0; $n<1000000; ++$n) {
        despace_mine($text,$howmany);
        }
        print time-$t, "\n";

        sub despace_lemire {
        my $text=shift(@_);
        my $howmany=shift(@_);
        my $i;
        my $c;
        my $pos=0;
        for ($i=0; $i<$howmany; ++$i) {
        $c=substr($text,$i,1);
        next if ($c eq " ");
        substr($text,$pos++,1)=$c;
        }
        #truncate to new length (perl doesn't copy nulls)
        $text=substr($text,0,$pos);
        return($text);
        }

        sub despace_mine {
        my $text=shift(@_);
        my $howmany=shift(@_);
        my $i=0;
        my $c;
        my $from;
        my $to;

        #skip initial non-space section
        $i=0;
        while(substr($text,$i,1) ne " ") { ++$i; }

        $to=$i;
        #skip whitespace we found
        while(substr($text,$i,1) eq " ") { ++$i; }

        $from=$i;
        while($i= $howmany); }

        #copy
        substr($text,$to,$i-$from) = substr($text,$from,$i-$from);
        $to += $i-$from;

        #skip whitespace
        while(substr($text,$i,1) eq ” “) { last if (++$i >= $howmany); }

        $from=$i;
        }
        #truncate to new length (perl doesn’t copy nulls)
        $text=substr($text,0,$to);
        return($text);
        }

        1. When I say you are copying every byte twice, I mean this:
          char c = bytes[i];

          bytes[pos++] = c;
          Explicitly the code is copying the byte somewhere else, and then copying it back. What is actually happening after optimization may well be a different story.

          I would argue that in C, these are semantically equivalent…

          char c = bytes[i];
          bytes[pos] = c;

          and

          bytes[pos] = bytes[i];

          Regarding how this gets compiled… on all processors that I am familiar with, they both get compiled to two “move” instructions. You first have to retrieve “bytes[i]” to a register, then you need to write it to memory. There might be processors that support moving memory from RAM to RAM without going through a register, but I am not familiar with them.

          There are certainly instructions (on x64 processors) that work directly on memory, without loading the data into a register (the bit test instruction one such instruction). However, it is not necessarily faster and can even be much slower. I believe that in a lot of important cases, the compiler will get it right.

          My understanding is that there is no such instruction on ARM processors, they always need to load the data in memory. So there is a load, an operation and then a store.

          But let us focus on x64 processors… Is there any difference between how an x64 compiler will handle

          const uint64_t mask = 1111;
          a[0] &= mask;

          and

          a[0] &= 1111;

          ???

          Do you have any example where it produces different machine code?

  4. I wonder if it would be faster look at 32bits at a time. At least in english, a bulk of the words are in the 3-5 letter length or under.

  5. It seems you are not considering alignment.

    Try the following: allocate and memory align a buffer, memcpy the string into it and then do the SIMD job on the memory aligned string.

    (Consider how the cache works, if you try move 64 bytes starting on on address 63, then it has to read not 64 bytes but 128 bytes, so be sure to align so that the starting address of the allocated buffer is evenly divisibly by 64 – do that by over-allocating and then move the pointer appropriately)

      1. Thanks, interesting!

        Not quite the improvement I thought it would be. Yet it definitely did result in an improvement on my machine (Intel(R) Core(TM) i7-3540M CPU @ 3.00GHz)

        align to: 3
        sse42_despace_branchless_lookup(buffer, N): 1.058594 cycles / ops
        align to: 7
        sse42_despace_branchless_lookup(buffer, N): 1.001953 cycles / ops
        align to: 63
        sse42_despace_branchless_lookup(buffer, N): 0.990234 cycles / ops
        align to: 2
        sse42_despace_branchless_lookup(buffer, N): 0.963867 cycles / ops
        align to: 8
        sse42_despace_branchless_lookup(buffer, N): 0.896484 cycles / ops
        align to: 16
        sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
        align to: 32
        sse42_despace_branchless_lookup(buffer, N): 0.568359 cycles / ops
        align to: 64
        sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops
        align to: 128
        sse42_despace_branchless_lookup(buffer, N): 0.565430 cycles / ops
        align to: 256
        sse42_despace_branchless_lookup(buffer, N): 0.539062 cycles / ops

        1. Let me clarify that. Aligning to 64 – you are completely right. Alignment being a “myth” – not so much, at least on my machine.

          1. Note that this answer will depend on the chip you use for testing. When I implemented a similar algorithm for UTF-8 to UTF-16 transformation, doing extra work to align things was moderately better on Haswell-E (Xeon E5-1650v3), but was no gain on Skylake-T (Core i7 6700T). In general, the newer chips seem to deal with unaligned access better than older ones.

        2. I modified my benchmark so that you can add an alignment offset…

          Results on my Skylake server…

          $ ./despacebenchmark
          pointer alignment = 16 bytes
          memcpy(tmpbuffer,buffer,N):  0.109375 cycles / ops
          countspaces(buffer, N):  3.673828 cycles / ops
          despace(buffer, N):  5.578125 cycles / ops
          faster_despace(buffer, N):  1.740234 cycles / ops
          despace64(buffer, N):  2.542969 cycles / ops
          despace_to(buffer, N, tmpbuffer):  5.585938 cycles / ops
          avx2_countspaces(buffer, N):  0.365234 cycles / ops
          avx2_despace(buffer, N):  3.593750 cycles / ops
          sse4_despace(buffer, N):  0.816406 cycles / ops
          sse4_despace_branchless(buffer, N):  0.386719 cycles / ops
          sse4_despace_trail(buffer, N):  1.539062 cycles / ops
          sse42_despace_branchless(buffer, N):  0.603516 cycles / ops
          sse42_despace_branchless_lookup(buffer, N):  0.675781 cycles / ops
          sse42_despace_to(buffer, N,tmpbuffer):  1.707031 cycles / ops
          
          $ ./despacebenchmark  1
          alignment offset = 1
          pointer alignment = 1 bytes
          memcpy(tmpbuffer,buffer,N):  0.109375 cycles / ops
          countspaces(buffer, N):  3.675781 cycles / ops
          despace(buffer, N):  5.576172 cycles / ops
          faster_despace(buffer, N):  1.693359 cycles / ops
          despace64(buffer, N):  2.589844 cycles / ops
          despace_to(buffer, N, tmpbuffer):  5.492188 cycles / ops
          avx2_countspaces(buffer, N):  0.373047 cycles / ops
          avx2_despace(buffer, N):  3.535156 cycles / ops
          sse4_despace(buffer, N):  0.841797 cycles / ops
          sse4_despace_branchless(buffer, N):  0.398438 cycles / ops
          sse4_despace_trail(buffer, N):  1.599609 cycles / ops
          sse42_despace_branchless(buffer, N):  0.605469 cycles / ops
          sse42_despace_branchless_lookup(buffer, N):  0.679688 cycles / ops
          sse42_despace_to(buffer, N,tmpbuffer):  1.718750 cycles / ops
          
          $ ./despacebenchmark  4
          alignment offset = 4
          pointer alignment = 4 bytes
          memcpy(tmpbuffer,buffer,N):  0.109375 cycles / ops
          countspaces(buffer, N):  3.679688 cycles / ops
          despace(buffer, N):  5.630859 cycles / ops
          faster_despace(buffer, N):  1.685547 cycles / ops
          despace64(buffer, N):  2.562500 cycles / ops
          despace_to(buffer, N, tmpbuffer):  5.523438 cycles / ops
          avx2_countspaces(buffer, N):  0.373047 cycles / ops
          avx2_despace(buffer, N):  3.603516 cycles / ops
          sse4_despace(buffer, N):  0.843750 cycles / ops
          sse4_despace_branchless(buffer, N):  0.400391 cycles / ops
          sse4_despace_trail(buffer, N):  1.552734 cycles / ops
          sse42_despace_branchless(buffer, N):  0.603516 cycles / ops
          sse42_despace_branchless_lookup(buffer, N):  0.681641 cycles / ops
          sse42_despace_to(buffer, N,tmpbuffer):  1.681641 cycles / ops
          

          Results on a much older Ivy Bridge (like yours):

          $ ./despacebenchmark
          pointer alignment = 16 bytes
          memcpy(tmpbuffer,buffer,N):  0.339844 cycles / ops
          countspaces(buffer, N):  14.519531 cycles / ops
          despace(buffer, N):  12.612305 cycles / ops
          faster_despace(buffer, N):  7.248047 cycles / ops
          despace64(buffer, N):  10.275391 cycles / ops
          despace_to(buffer, N, tmpbuffer):  11.271484 cycles / ops
          sse4_despace(buffer, N):  2.039062 cycles / ops
          sse4_despace_branchless(buffer, N):  1.081055 cycles / ops
          sse4_despace_trail(buffer, N):  4.904297 cycles / ops
          sse42_despace_branchless(buffer, N):  1.511719 cycles / ops
          sse42_despace_branchless_lookup(buffer, N):  1.554688 cycles / ops
          sse42_despace_to(buffer, N,tmpbuffer):  3.882812 cycles / ops
          
          $ ./despacebenchmark 1
          alignment offset = 1
          pointer alignment = 1 bytes
          memcpy(tmpbuffer,buffer,N):  0.339844 cycles / ops
          countspaces(buffer, N):  13.337891 cycles / ops
          despace(buffer, N):  12.699219 cycles / ops
          faster_despace(buffer, N):  7.292969 cycles / ops
          despace64(buffer, N):  10.285156 cycles / ops
          despace_to(buffer, N, tmpbuffer):  11.269531 cycles / ops
          sse4_despace(buffer, N):  2.214844 cycles / ops
          sse4_despace_branchless(buffer, N):  1.749023 cycles / ops
          sse4_despace_trail(buffer, N):  5.027344 cycles / ops
          sse42_despace_branchless(buffer, N):  2.109375 cycles / ops
          sse42_despace_branchless_lookup(buffer, N):  2.187500 cycles / ops
          sse42_despace_to(buffer, N,tmpbuffer):  3.890625 cycles / ops
          
          $ ./despacebenchmark 4
          alignment offset = 4
          pointer alignment = 4 bytes
          memcpy(tmpbuffer,buffer,N):  0.339844 cycles / ops
          countspaces(buffer, N):  14.507812 cycles / ops
          despace(buffer, N):  12.699219 cycles / ops
          faster_despace(buffer, N):  7.214844 cycles / ops
          despace64(buffer, N):  10.214844 cycles / ops
          despace_to(buffer, N, tmpbuffer):  11.289062 cycles / ops
          sse4_despace(buffer, N):  2.162109 cycles / ops
          sse4_despace_branchless(buffer, N):  1.757812 cycles / ops
          sse4_despace_trail(buffer, N):  5.044922 cycles / ops
          sse42_despace_branchless(buffer, N):  2.109375 cycles / ops
          sse42_despace_branchless_lookup(buffer, N):  2.302734 cycles / ops
          sse42_despace_to(buffer, N,tmpbuffer):  3.875000 cycles / ops
          

          There are ways to trigger performance issues due to alignment on recent x64 processors, but you almost have to go out of your way to do it.

  6. It seems you are not memory aligning your buffers.

    Memory align a temporary buffer that you memcpy the input string to. After that do the operation on the temporary buffer.

    (Consider doing a 64 byte move starting on address 63. The CPU would have to read/write not 64 bytes but 2×64 bytes – since the address was not aligned.).

  7. In the naive implementation you copy even the bytes before the first space. They don’t need to be copied because they’re already there. Two loops could make it faster.

  8. This is a completely memory bound problem and the “fast” version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.

    The biggest positive benefit of writing code like this (good CPU use in a memory bound problem) is giving the CPU an opportunity to do some hyperthreading while the current CPU thread is waiting for memory.

    I wrote some more insights from my experiences in similar optimization tasks over at HN: https://news.ycombinator.com/item?id=13450245

    I added wall clock time measuring with clock_gettime http://pasteall.org/208511

    Here are the results:

    memcpy(tmpbuffer,buffer,N): 0.125130 cycles / ops 1451124749 nsec (1.451125 sec)
    countspaces(buffer, N): 3.710938 cycles / ops 1672809379 nsec (1.672809 sec)
    despace(buffer, N): 6.603971 cycles / ops 1606222818 nsec (1.606223 sec)
    faster_despace(buffer, N): 1.688668 cycles / ops 1547048647 nsec (1.547049 sec)
    despace64(buffer, N): 3.590088 cycles / ops 1602308741 nsec (1.602309 sec)
    despace_to(buffer, N, tmpbuffer): 6.306158 cycles / ops 1630737083 nsec (1.630737 sec)
    avx2_countspaces(buffer, N): 0.191219 cycles / ops 1442104811 nsec (1.442105 sec)
    avx2_despace(buffer, N): 5.754352 cycles / ops 1579367806 nsec (1.579368 sec)
    sse4_despace(buffer, N): 0.984717 cycles / ops 1486145973 nsec (1.486146 sec)
    sse4_despace_branchless(buffer, N): 0.339487 cycles / ops 1498728832 nsec (1.498729 sec)
    sse4_despace_trail(buffer, N): 1.958952 cycles / ops 1479234359 nsec (1.479234 sec)
    sse42_despace_branchless(buffer, N): 0.563217 cycles / ops 1534317675 nsec (1.534318 sec)
    sse42_despace_branchless_lookup(buffer, N): 0.624996 cycles / ops 1563285415 nsec (1.563285 sec)
    sse42_despace_to(buffer, N,tmpbuffer): 1.747602 cycles / ops 1525936772 nsec (1.525937 sec)

    1. This is a completely memory bound problem and the “fast” version is not much faster (about 5%) when measured in wall clock time. Using rdtsc to measure CPU perf counters does not take waiting into account.

      On recent Intel processors, rdtsc measures the wall-clock time though maybe more accurately due to low overhead.

  9. Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.

    Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.

    1. Assuming this is using the built-in ^ operator, the condition haszero(xor1) ^ haszero(xor2) ^ haszero(xor3)) is true if and only if an odd number of the function results are true.

      Thanks, I fixed that. It does not impact performance. In this case, my code passed sanity tests because of the statistics of my test.

      Re the hard-to-explain zero checking expression, I think a reasonably short & simple way to explain it is to note that for each byte it checks whether that byte was non-negative, and became negative (sign bit set) by the subtraction of 1.

      This makes sense. I didn’t want to bother with an explanation in the course of my blog post because *why* it works is not so important.

  10. I’m not at a proper computer atm. Can you try

    1. Load 8bytes into xmm.
    2. Compare for equality with spaces, etc.
    3. Move the mask to a general purpose register.
    4. Negate bits in that register.
    5. Copy selected bits with PEXT (from bmi2 set) instruction.
    6. Use POPCNT to advance a pointer.

    Thanks,
    Alex

    1. Using PEXT is interesting. We have 16 bytes to deal with. PEXT only operates on 8 bytes. So we need two, with two pointer offsets. It might be fast, but it is not obviously faster than a SIMD-based approach, I think.

  11. What is the performance of a simple sorting network
    using SIMD?

    Would it be better to define the problem as “shift all the whitespace to side of the vector”?

  12. There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.

    To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.

    Anyway, that table is huge, 256 KiB. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.

    1. There’s an error in `_mm_loadu_si128(despace_mask16 + mask16)` which has been fixed in the repo. Please update the blog post; this threw me for a loop.

      I don’t think that there is an error, if there is, please point it out more precisely. In the blog post, I omit the cast which is present in the GitHub repository, but it is a technical detail that despace_mask16 is not of type __m128i * (that is, you can rewrite the code so that it is).

      To define a proper `__m128i` array, use the `_mm_set_epi8` macro which expands to a suitable constant initializer.

      In C, _mm_set_epi8 is not a compile-time constant so you cannot do “__m128i array[] = {_mm_set1_epi8(0)};”. You must be thinking of C++.

      It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.

      I think that the table is 1MB.

    2. Anyway, that table is huge. It would probably be faster if split into two 256-entry tables, totaling 8 KiB which would fit into L1 cache.

      Please see the GitHub repository, function sse4_despace_branchless_mask8. It is a tad slower. There is an extra load, an extra mask, an extra shift and and an extra XOR. All of these things appear to add up.

  13. What is / can you explain `despace_mask16`? I don’t understand the bit twiddling hack going on here to give the right constant to the shuffle unit.

    1. @Billy

      What is / can you explain `despace_mask16`?

      We are using the _mm_shuffle_epi8 intrinsic to move the bytes around so that flagged characters are omitted. This intrinsic takes an input register “a” as well as a control mask “b” and it outputs a new vector
      (a[b[0]], a[b[1]], a[b[2]], a[b[3]], …). Computing the mask “b” on the fly is hard given the locations of the characters, so we use a look-up table.

      It is not pretty.

      1. >so we use a look-up table

        Derp on my part — for some reason I read that initially as supplying that constant to _mm_shuffle_epi8 — I missed the load instruction. Lookup table makes sense, thanks!

  14. It looks to me like the SIMD code won’t copy the last few bytes if the size is not a multiple of 16. Am I missing something?

Leave a Reply

Your email address will not be published. Required fields are marked *