In C, the length of a string in marked by a 0 byte at the end of the string. Thus to determine the length of the string, one must scan it, looking for the 0 byte. Recent ARM processors have a powerful instruction set (SVE) that is well suited for such problems. It allows you to load large registers at once and to do wide comparisons (comparing many bytes at once).
Yet we do not want to read too much data. If you read beyond the string, you could hit another memory page and trigger a segmentation fault. This could crash your program.
Thankfully, SVE comes with a load instruction that would only fault on the ‘first active element’: as long as the first element you are loading is valid, then there is no fault.
With this in mind, a simple algorithm to compute the length of a C string is as follows:
- Load a register.
- Compare each byte in it to 0.
- If any comparison matches, then locate the match and return the corresponding length.
- If not, increment by the register size (given by svcntb()), and repeat.
Using intrinsics, the code looks as follows…
size_t sve_strlen(const char *s) { size_t len = 0; while (true) { svuint8_t input = svldff1_u8(svptrue_b8(), (const uint8_t *)s + len); svbool_t matches = svcmpeq_n_u8(svptrue_b8(), input, 0); if (svptest_any(svptrue_b8(), matches)) { return len + svlastb_u8(svbrka_z(matches, matches), svindex_u8(0, 1)); } len += svcntb(); } }
In assembly, the code looks as follows…
mainloop: ldff1b { z0.b }, p0/z, [x10, x8] add x8, x8, x9 cmpeq p1.b, p0/z, z0.b, #0 b.eq .mainloop
I do not discuss the details of my function, but it assumes implicitly that the underlying register size is no larger than 256 bytes which is guaranteed by the specification.
Benchmarking against your system’s strlen function is difficult methodologically. Nevertheless, we can measure the number of instructions retired for sizeable strings as an indication of the relative speed. Using GCC 11 on a graviton 3 system (Amazon), I get the following metrics:
system’s strlen | 0.23 instructions per byte |
SVE | 0.15 instructions per byte |
Though I do not advocate adopting SVE as a replacement for strlen at this time, the potential is interesting, considering that I threw together my implementation in minutes.
Credit: Thanks to Denis Yaroshevskiy for inviting me to look at non-faulting loads.
Update: It turns out that strlen was one of the examples that ARM used in its slides presenting SVE. At a glance, their implementation looks like mine but with more sophistication.
Hello,
As per specification, SVE vector length can’t exceed 2048 bits/256 bytes so svcntb will never be larger than 256.
Beyond the slide deck you linked, Arm has published several routines here: https://github.com/ARM-software/optimized-routines/tree/master/string/aarch64
Thanks: it is great to find out that my missing check was unnecessary.
As an aside, how much gain is there in eliminating the edge cases?
When writing string-use-intense applications, I tended to allocate string buffers of a power-of-two size (256 bytes or less), from a string pool, and reallocate through a free-list. Did this for efficiency in allocation (measured). As a side-effect those page-aligned size-quantized buffers would fit your algorithm without edge cases. Has to be some gain there.
How far does this go – if you design a string-class to take full advantage through eliminating edge-cases?
When we allocate a string buffer, we could always zero-fill the buffer (there will always be a zero), or non-zero-fill the buffer (only zero belongs to the written data).
Most application-strings are short – less than 256 bytes, and mostly less than 80 bytes. How much gain in unrolling the loop?
Clearly for very long strings, the edge-case hardly matters. But most strings are short – near to the size of a vector-stride.