When you start a program, it creates a ‘process’ which own its memory. Memory is allocated to a software process in blocks called ‘pages’. These pages might span 4kB, 16kB or more. For a given process, it is safe to read and write within these pages.
In your code, you might allocate a 32-byte array. How much memory does the array require? The answer is that the allocation of the array might require no extra memory because the process had already the room needed in its pages, or else, the array might entice the operating system to grant the process many more pages. Similarly, ‘freeing’ the array does not (generally) reclaim the memory.
In general, the operating system and the processor do not care when your program reads and writes anywhere within the pages allocated to it. These pages are the ‘segment’ that the process owns. When you do access a forbidden page, one that was not allocated to your process, then you normally get a segmentation fault. Most of the time, it means that your program crashes.
Interestingly, if I allocate a small array and then I read beyond its bounds, the program will often not immediately crash and may even remain correct. Thus you can allocate an array with room for two integers, read three integers from it, and your program might ‘work’. The following function is wrong but it might very well work fine for years… (and then it will mysterious crash)
int f() { int* data = new int[2]; data[0] = 1; data[1] = 2; int x = data[0]; int y = data[1]; int z = data[2]; delete[] data; return x + y; }
But why would you ever want to read beyond the bounds of the allocated memory? For performance and/or convenience.
Modern processors have vector instructions designed to work on ‘large registers’. For example, recent Intel processors have 512-bit registers. Such a register can store 16 standard floating-point values. The following code will compute the dot product between two vectors very quickly… (read the comments in the code to follow through)
float dot512fma(float *x1, float *x2, size_t length) { // create a vector of 16 32-bit floats (zeroed) __m512 sum = _mm512_setzero_ps(); for (size_t i = 0; i < length; i += 16) { // load 16 32-bit floats __m512 v1 = _mm512_loadu_ps(x1 + i); // load 16 32-bit floats __m512 v2 = _mm512_loadu_ps(x2 + i); // do sum[0] += v1[i]*v2[i] (fused multiply-add) sum = _mm512_fmadd_ps(v1, v2, sum); } // reduce: sums all elements return _mm512_reduce_add_ps(sum); }
There is a problem with this code, however. If length is not a multiple of sixteen, then we might read too much data. This might or might not crash the program, and it may give wrong results.
What might you do? With earlier processors, you had to handle it in an ad hoc manner.
The solution today is to use masked loads, to load just the data that is safe to read in wide register. For AVX-512, it has hardly any overhead, except for the computation of the mask itself. You might code it as follows:
float dot512fma(float *x1, float *x2, size_t length) { // create a vector of 16 32-bit floats (zeroed) __m512 sum = _mm512_setzero_ps(); size_t i = 0; for (; i + 16 <= length; i+=16) { // load 16 32-bit floats __m512 v1 = _mm512_loadu_ps(x1 + i); // load 16 32-bit floats __m512 v2 = _mm512_loadu_ps(x2 + i); // do sum[0] += v1[i]*v2[i] (fused multiply-add) sum = _mm512_fmadd_ps(v1, v2, sum); } if (i < length) { // load 16 32-bit floats, load only the first length-i floats // other floats are automatically set to zero __m512 v1 = _mm512_maskz_loadu_ps((1<<(length-i))-1, x1 + i); // load 16 32-bit floats, load only the first length-i floats __m512 v2 = _mm512_maskz_loadu_ps((1<<(length-i))-1, x2 + i); // do sum[0] += v1[i]*v2[i] (fused multiply-add) sum = _mm512_fmadd_ps(v1, v2, sum); } // reduce: sums all elements return _mm512_reduce_add_ps(sum); }
As you can see, I have added a final branch which works with a computed mask. The mask is computed using arithmetic. In this case, I use an ugly formula: ((1<<(length-i))-1).
It is also possible to use just one loop, and to update the mask for the final iteration, but it makes the code slightly harder to understand in my view.
In any case, using masks, I achieve high performance: I use fast AVX-512 instructions throughout. It is also convenient: I can write the entire function with the same coding style.
A sensible question is whether these masked loads and stores are really safe with respect to segmentation faults. You can check it by repeatedly writing and loading just beyond the allocated memory, eventually you will cause a segmentation fault. I have written a small C++ test. The following code always crashes at the last line of the loop, where status has value ‘RIGHT_AFTER’.
uint8_t *data = new uint8_t[1024]; size_t ps = page_size(); // round up to the end of the page: uintptr_t page_limit = ps - (reinterpret_cast<uintptr_t>(data) % ps) - 1; __m128i ones = _mm_set1_epi8(1); // register filled with ones for (int z = 0;; z++) { status = RIGHT_BEFORE; data[z * ps + page_limit] = 1; status = AVX512_STORE; _mm_mask_storeu_epi8(data + z * ps + page_limit, 1, ones); status = AVX512_LOAD; __m128i oneandzeroes = _mm_maskz_loadu_epi8(1, data + z * ps + page_limit); status = RIGHT_AFTER; data[z * ps + page_limit + 1] = 1; }
What about ARM processors ? Thankfully, you can do much of the same thing using Amazon’s graviton processors. A dot product might look as follows:
float dotsve(float *x1, float *x2, int64_t length) { int64_t i = 0; svfloat32_t sum = svdup_n_f32(0); while(i + svcntw() <= length) { svfloat32_t in1 = svld1_f32(svptrue_b32(), x1 + i); svfloat32_t in2 = svld1_f32(svptrue_b32(), x2 + i); sum = svmad_f32_m(svptrue_b32(), in1, in2, sum); i += svcntw(); } svbool_t while_mask = svwhilelt_b32(i, length); do { svfloat32_t in1 = svld1_f32(while_mask, x1 + i); svfloat32_t in2 = svld1_f32(while_mask, x2 + i); sum = svmad_f32_m(svptrue_b32(), in1, in2, sum); i += svcntw(); while_mask = svwhilelt_b32(i, length); } while (svptest_any(svptrue_b32(), while_mask)); return svaddv_f32(svptrue_b32(),sum); }
It is the same algorithm. One difference is that SVE has its own intrinsic functions to general masks. Furthermore, while AVX-512 allow you to pick different register sizes, SVE hides away the register sizes, so that your binary code should run irrespective of the register size. SVE has also ‘non-faulting’ loads and other options.
My code is available: both the AVX-512 code and the ARM/SVE2 code (in a separate directory). You may need access to AWS (Amazon) to run the ARM/SVE2 code.
It seems that the footnote is missing, and another sentence is truncated: “It could also discourage the”.
Thanks. These were additional edits that I forgot to discard.
Pretty cool!
Yeah, reading the tail of a stream is often tricky to get both fast and correct. I’ve seen a similar problem for hash functions for short strings, came up with a different approach (probably also applicable here), where we read “garbage” memory, but we can guarantee it’s “safe” by checking if we’re close to the page boundary. I think t1ha folks included that in their project. See here if you’re curious: https://marcinzukowski.github.io/hash-tailer/.
And thanks for your awesome blog posts in general! 🙂
Thanks Marcin. You may not remember, but we once spent a week together in Germany !
Of course, Dagstuhl, one of my fave places in the world 🙂
https://lemire.me/img/news/2018/dagstuhl.18251.04.jpg
I’ve mitigated this problem by manually calling resize() on vectors to enlarge them sufficiently beyond their size such that over-reads are guaranteed to not segfault, but that’s an expensive solution if it causes a large copy.
A better solution seems to align every allocation that might be accessed with vectorized loads to the largest vector size that it will be accessed with, because then it’s guaranteed that reads of that size will not over-run a page. Easy to do in C with aligned_alloc, unfortunately pretty un-ergonomic to do with vectors in C++. c++17’s aligned new helps, but it’s still messy: https://stackoverflow.com/questions/60169819/modern-approach-to-making-stdvector-allocate-aligned-memory
I have not considered any potential downsides of having most large allocations aligned to m512i, if there are any.
Overallocation is certainly a viable approach but you cannot always rely on it being done (by libraries you are using).
What I’ve found works really well for me at my current job is to write the main part of the loop as an inline function taking the __m512/svfloat32_t registers and any mask/predicates as variables. This allows me to not repeat myself for the main part of the loop and the remainders.
I agree that it seems like a good design.
But doen’t that assume that the versions with and without _maskz (on x86) are equivalent in performance? Perhaps they are, but I’d double check 🙂
You should always measure but I believe it is true (so far) and that’s what I claim in the blog post. The computation of the masks is not free, however.
[A little late to the party…] Sometimes, those masked load/stores also could become hot if the number of processed elements is low (<16, loop never executes). I solved such an issue simply by allocating 16 dummy bytes (for SSE) at the end to avoid access violations by crossing page boundaries.