Implementing ‘strlen’ using SVE

In C, the length of a string in marked by a 0 byte at the end of the string. Thus to determine the length of the string, one must scan it, looking for the 0 byte. Recent ARM processors have a powerful instruction set (SVE) that is well suited for such problems. It allows you to load large registers at once and to do wide comparisons (comparing many bytes at once).

Yet we do not want to read too much data. If you read beyond the string, you could hit another memory page and trigger a segmentation fault. This could crash your program.

Thankfully, SVE comes with a load instruction that would only fault on the ‘first active element’: as long as the first element you are loading is valid, then there is no fault.

With this in mind, a simple algorithm to compute the length of a C string is as follows:

  1. Load a register.
  2. Compare each byte in it to 0.
  3. If any comparison matches, then locate the match and return the corresponding length.
  4. If not, increment by the register size (given by svcntb()), and repeat.

Using intrinsics, the code looks as follows…

size_t sve_strlen(const char *s) {
  size_t len = 0;
  while (true) {
    svuint8_t input = svldff1_u8(svptrue_b8(), (const uint8_t *)s + len);
    svbool_t matches = svcmpeq_n_u8(svptrue_b8(), input, 0);
    if (svptest_any(svptrue_b8(), matches)) {
      return len + svlastb_u8(svbrka_z(matches, matches), svindex_u8(0, 1));
    }
    len += svcntb();
  }
}

In assembly, the code looks as follows…

mainloop:
        ldff1b  { z0.b }, p0/z, [x10, x8]
        add     x8, x8, x9
        cmpeq   p1.b, p0/z, z0.b, #0
        b.eq    .mainloop

I use BRKA (Break after first true condition) to locate the index of the first null character in the register when there is one. It creates a mask with a single true at the location where the nul appear. I then use LASTB (Extract last active element) combined with a constant (created with svindex_u8) which contains the byte values 0,1,2,3…. So if the nul appears at index 3, BRKA sets index 3 as true, then LASTB will extract the value at index 3 in the constant created with svindex_u8. Because the underlying register size is no larger than 256 bytes, we know that svindex_u8 cannot overflow (and wrap around 254,255, 0, 1, …).

Denis Yaroshevskiy proposed an alternative implementation which is simpler in how it handles the computation of the first nul in the register. Instead of my complicated approach, he recommends BRKB (Break before first true condition) which creates a mask where everything true prior to the first null character, and then he recommends calling CNTP (Count active elements).

size_t sve_strlen(const char *s) {
  size_t len = 0;
  while (true) {
    svuint8_t input = svldff1_u8(svptrue_b8(), (const uint8_t *)s + len);
    svbool_t matches = svcmpeq_n_u8(svptrue_b8(), input, 0);
    if (svptest_any(svptrue_b8(), matches)) {
      svbool_t before_nuls = svbrkb_z(svptrue_b8(), matches);
      return len + svcntp_b8(before_nuls, before_nuls);
    }
    len += svcntb();
  }
}

Benchmarking against your system’s strlen function is difficult methodologically. Nevertheless, we can measure the number of instructions retired for sizeable strings as an indication of the relative speed. Using GCC 11 on a graviton 3 system (Amazon), I get the following metrics for sizeable inputs:

system’s strlen 0.23 instructions per byte
SVE 0.15 instructions per byte
SVE (Yaroshevskiy) 0.15 instructions per byte

My methodology is not fine enough to distinguish between my SVE approach and Yaroshevskiy’s proposal. His proposal is simpler, however.

Though I do not advocate adopting SVE as a replacement for strlen at this time, the potential is interesting, considering that I threw together my implementation in minutes.

My source code is available.

Credit: Thanks to Denis Yaroshevskiy for inviting me to look at non-faulting loads.

Update: It turns out that strlen was one of the examples that ARM used in its slides presenting SVE. At a glance, their implementation looks like mine but with more sophistication.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

3 thoughts on “Implementing ‘strlen’ using SVE”

  1. As an aside, how much gain is there in eliminating the edge cases?

    When writing string-use-intense applications, I tended to allocate string buffers of a power-of-two size (256 bytes or less), from a string pool, and reallocate through a free-list. Did this for efficiency in allocation (measured). As a side-effect those page-aligned size-quantized buffers would fit your algorithm without edge cases. Has to be some gain there.

    How far does this go – if you design a string-class to take full advantage through eliminating edge-cases?

    When we allocate a string buffer, we could always zero-fill the buffer (there will always be a zero), or non-zero-fill the buffer (only zero belongs to the written data).

    Most application-strings are short – less than 256 bytes, and mostly less than 80 bytes. How much gain in unrolling the loop?

    Clearly for very long strings, the edge-case hardly matters. But most strings are short – near to the size of a vector-stride.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.