Trimming spaces from strings faster with SVE on an Amazon Graviton 3 processor

Programmers sometimes need to trim, or remove, characters, such as spaces from strings. It might be a surprising expensive task. In C/C++, the following function is efficient:

size_t trimspaces(const char *s, size_t len, char *out) {
  char * init_out{out};
  for(size_t i = 0; i < len; i++) {
    *out = s[i];
    out += (s[i] != ' ');
  }
  return out - init_out;
}

Basically, we write all characters from the input, but we only increment the pointer if the input is not a space.

Amazon makes available new ARM-based systems relying on their Graviton 3 processors. These processors support advanced instructions called “SVE”. One very nice family of instructions to ‘compact’ values (effectively, remove unwanted values). I put it to good use when filtering out integer values from arrays.

Unfortunately, the SVE compact instructions cannot be directly applied to the problem of pruning spaces in strings, because they only operate on larger words (e.g., 32-bit words). But, fortunately, it is possible to load bytes directly into 32-bit values so that each byte value occupies 32-bit in memory using intrinsics functions such as svld1sb_u32. As you would expect, you can also do the reverse, and take an array of 32-bit values, and automatically convert it to a byte array (e.g., using svst1b_u32).

Thus I can take my byte array (a string), load it into temporary 32-bit vectors, prune these vectors, and then store the result as a byte array, back to a string. The following C code is a reasonable implementation of this idea:

size_t sve_trimspaces(const char *s, size_t len, char *out) {
  uint8_t *out8 = reinterpret_cast<uint8_t *>(out);
  size_t i = 0;
  for (; i + svcntw() <= len; i += svcntw()) {
   svuint32_t input = svld1sb_u32(svptrue_b32(), (const int8_t *)s + i);
   svbool_t matches = svcmpne_n_u32(svptrue_b32(), input, 32);
   svuint32_t compressed = svcompact_u32(matches, input);
   svst1b_u32(svptrue_b32(), out8, compressed);
   out8 += svcntp_b32(svptrue_b32(), matches);
  }
  if (i < len) {
   svbool_t read_mask = svwhilelt_b32(i, len);
   svuint32_t input = svld1sb_u32(read_mask, (const int8_t *)s + i);
   svbool_t matches = svcmpne_n_u32(read_mask, input, 32);
   svuint32_t compressed = svcompact_u32(matches, input);
   svst1b_u32(read_mask, out8, compressed);
   out8 += svcntp_b32(read_mask, matches);
  }
  return out8 - reinterpret_cast<uint8_t *>(out);
}

Is it faster? Using GCC 12 on a Graviton 3, I get that the SVE approach is 3.6 times faster and it uses 6 times fewer instructions. The SVE code is not six times faster, because it is retiring fewer instructions per cycle. My code is available.

conventional code 1.8 cycles/bytes 7 instructions/byte
SVE code 0.5 cycles/bytes 1.1 instructions/byte

You can achieve a similar result with ARM NEON, but you require large tables to emulate the compression instructions. Large tables lead to code bloat which, at the margin, is not desirable.

Daniel Lemire, "Trimming spaces from strings faster with SVE on an Amazon Graviton 3 processor," in Daniel Lemire's blog, March 10, 2023.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

13 thoughts on “Trimming spaces from strings faster with SVE on an Amazon Graviton 3 processor”

  1. I wonder if it would speed up Unicode whitespace removal or could be twisted into removing all whitespace type of characters (VT, HT, NL, CR, …)?

  2. The documentation I have seen shows that SVE does not need a drain operation, it can instead be predicated to do a partial operation for the remainder elements.

    See slides 32 and 33 of this SVE presentation for an example of what I mean.

    Is there a specific reason why you coded it do work in two steps? I could see this being a naive translation from SIMD (NEON) style to SVE, but I don’t think it is leveraging the full benefits of SVE predication.

    1. See slides 32 and 33 of this SVE presentation for an example of what I mean.

      If you write it in this manner, you will have an additional intrinsic function per loop (and the accompanying instruction) compared to my solution (focus on the main loop).

      Is there a specific reason why you coded it do work in two steps?

      For better performance. I have added the single-loop function to my benchmark, you can test it out and verify that it is slower.

      1. Thanks for the response, I suspected that may be the reason why.

        What if you use the optimization given on slide 34 to elide the i < len comparison? Does it catch back up? Sorry I do not have access right now to test it myself 🙂

        1. I have added the one loop alternative you pointed to. The performance is slightly improved but still very much inferior to the two-loop approach.

          I should point out that my expectation is that my implementation with a loop and a trail is probably too simple and could be optimized further (at the cost of greater code complexity).

          scalar
          cycles/bytes 1.75
          
          sve one loop
          cycles/bytes 0.70
          
          sve one loop alt
          cycles/bytes 0.66
          
          sve
          cycles/bytes 0.47 
          

          So, unfortunately, the pretty code that ARM is displaying on their slides is likely not optimal.

          Of course, this could change with future compilers and/or processors. For example, the compiler could do the unrolling for you.

      2. Also sorry for saying it was a somehow bad or naive translation from neon, clearly not the case.

        Enjoyed the post, thanks for sharing.

        1. Also sorry for saying it was a somehow bad or naive translation from neon, clearly not the case.

          You might be interested in going back in time to how I started with SVE:
          https://lemire.me/blog/2022/06/23/filtering-numbers-quickly-with-sve-on-amazon-graviton-3-processors/

          You might notice that my early code matched the ARM slides more than my recent code.

          So, in some sense, what looked to you like “naive” code is, from my point of view, more sophisticated (optimized) code…i.e., it will run faster.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.