Most people have a mental model of computation based on the Turing machine. The computer does one operation at a time. For example, maybe it adds two numbers and outputs the result.

In truth, most modern processor cores are superscalar. They execute several instructions per CPU cycle (e.g., 4 instructions). That is above and beyond the fact that many processors have several cores.

Programmers should care about superscalarity because it impacts performance significantly. For example, consider an array of integers. You can compute the gaps between the integers, `y[i+1]=x[i+1]-x[i]`, faster than you can recover the original values from the gaps, `x[i+1]=y[i+1]+x[i]`. That is because the processor can compute several gaps at once whereas it needs to recover the values in sequence (e.g., `x[i]` before `x[i+1]`).

Superscalar execution is truly a wonderful piece of technology. It is amazing that our processors can reorder and regroup instructions without causing any bugs. And though you should be aware of it, it is mostly transparent: there is no need to rewrite your code to benefit from it.

There is another great modern feature that programmers need to be aware of: most modern processors support SIMD instructions. Instead of, say, adding two numbers, they can add two *vectors* of integers together. Recent Intel processors can add eight 32-bit integers using one instruction (`vpaddd`).

It is even better than it sounds: SIMD instructions are superscalar too… so that your processor could possibly add, say, sixteen 32-bit integers in one CPU cycle by executing two instructions at once. And it might yet squeeze a couple of other instructions, in the same CPU cycle!

Vectorization is handy to process images, graphics, arrays of data, and so on. However, unlike superscalar execution, vectorization does not come for free. The processor will not vectorize the computation for you. Thankfully, compilers and interpreters do their best to leverage SIMD instructions.

However, we are not yet at the point where compilers will rewrite your algorithms for you. If your algorithm does not takes into account vectorization, it may not be possible for the compiler to help you in this regard.

An important problem when working with databases or search engines is the computation of the intersection between sorted arrays. For example, given {1, 2, 10, 32} and {2, 3, 32}, you want {2, 32}.

If you assume that you are interested in arrays having about the same length, there are clever SIMD algorithms to compute the intersection. Ilya Katsov describes an elegant approach for 32-bit integers. If your integers fit in 16 bits, Schlegel et al. have similar algorithms using special string comparison functions available on Intel processors.

These algorithms are efficient, as long as the two input arrays have similar length… But life is not so easy. In many typical applications, you frequently need to compute the intersection between arrays having vastly different lengths. Maybe one array contains a hundred integers and the other one thousand. In such cases, you should fall back on a standard intersection algorithm based on a binary search (a technique sometimes called “galloping”).

Or should you fall back? In a recent paper, SIMD Compression and the Intersection of Sorted Integers, we demonstrate the power of a very simple idea to design better intersection algorithms. Suppose that you are given the number 5 and you want to know whether it appears in the list {1,2,4,6,7,8,15,16}. You can try to do it by binary search, or do a sequential scan… or better yet, you can do it with a simple vectorized algorithm:

- First represent your single number as a vector made entirely of this value: 5 becomes {5,5,5,5,5,5,5,5}. Intel processors can do this operation very quickly with one instruction.
- Compare the two vectors {5,5,5,5,5,5,5,5} and {1,2,4,6,7,8,15,16} using one instruction. That is, you can check eight equalities at once cheaply. In this instance, I would get {false,false,false,false,false,false,false,false}. It remains to check whether the resulting vector contains a true value which can be done using yet another instruction.

With this simple idea, we can accelerate a range of intersection algorithms with SIMD instructions. In our paper, we show that, on practical and realistic problems, you can double the speed of the state-of-the-art.

To learn more, you can grab our paper and check out our C++ code.

**Reference**:

**Further reading**: Efficient Intersections of compressed posting lists thanks to SIMD instructions by Leonid Boytsov.

Daniel Lemire, "Accelerating intersections with SIMD instructions," in *Daniel Lemire's blog*, March 25, 2015.

Superscalar architectures are significantly underexploited in practice, both by compilers and the microarchitectures themselves. While you gain some benefit automatically, the automatic exploitation still misses a large percentage of the internal parallelism available when you have several ALUs per core.

There is an informal discipline around writing C/C++ such that it exposes the ALU parallelism to the CPU while being a trivial and null reorganization of the code. Few people actually write code this way and the idioms are a bit outside the way programmers have learned to write their code; most programmers do not even know how to do it. Nonetheless, coding for ALU saturation reliably delivers 1.5-2x performance relative to what would otherwise be considered to be optimized C algorithms. This kind of uplift is particularly noticeable on more recent Intel cores like Haswell.

Exploiting superscalar microarchitectures has become worse as the number and complexity of the ALUs has increased.

@Andrew

Agreed.

I would add that you do not need to write in C to optimize for superscalar execution. I bet that you can measure the effects in JavaScript these days…

Sounds interesting 🙂 where can i find more info on this way of coding?

@Ingo

It is a good question.

Great article. For any one looking to find out about techniques/examples of creative uses of SIMD code, I have had a lot of luck looking up literature on gaming engines. An example here is a talk at GDC from an engineer at Insomniac games: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf

Game engines are kind of ideal for SIMD usage given they do the same work on a large number of objects over and over. Personally I have had pretty decent luck (more hits than misses) in employing SIMD code in general applications to speed things up. One of the really nice things SIMD code does for you is to make you really think about your memory layout. This kind of data oriented design thinking in my experience leads to great speedups even when it turns out that you cannot use SIMD.

I have also found SIMD to be of great use in encoding/decoding columnar data . Given most applications (even the ones without any disk/network IO) are memory bandwidth bound, encoding columns of data to make such smaller is often a win. SIMD lends itself pretty well for interesting ways to encode and decode such data on the fly without having to resort to full blown compression, which in many cases is way too expensive.

Daniel I’d love to see more articles from you about creative uses of SIMD. You seem to have published a lot of papers where you’ve used SIMD instructions to great effect. Great article and hope for more.

I’ve been interested in this paper and library for a while. Unfortunately the code isn’t portable as it doesn’t compile in MSVC. This is mostly because a few lines of inline assembly instead of intrinsics and some compiler specifics (__builtin_expect, etc.). Is there any plan to make this library portable so it can be more widely used?

@Andrew

Are you referring to the rules that allow for auto vectorization by the compiler like the ones mentioned in this block series:

http://blogs.msdn.com/b/nativeconcurrency/archive/2012/04/12/auto-vectorizer-in-visual-studio-11-overview.aspx

@Philippecp

As I do not program using MSVC, I rely on contributors (like you) to help with portability.

We have MSVC support for other projects, such as this one…

https://github.com/lemire/FastPFor

With cmake, it is reasonably easy to make portable builds.

I believe that someone could probably get the project to build under MSVC with a few hours of work. If you are interested in helping out, please get in touch!

@Philippecp

I have checked in a revision. The code no longer relies on inline assembly and __builtin_expect. So it should definitively be easier to use it with MSVC.

We still lack the build routines which could be provided with cmake.

Hi,

How about intersecting arrays of 64-bit integers? Any special trick or just replace

`_mm128`

instructions with their`_mm256`

counterparts,`f`

suffix for`float`

with`d`

suffix for`double`

, and so on?Thanks,

Bogdan

Using 64-bit integers reduce the benefits of SIMD instructions. I would first try to reengineer the problem.

Thank you for your answer.

Unfortunately that is not possible — the values come from hashing a very large number of objects where the number of unique objects might exceed the range of a 32-bit integer. But I am still exploring possible options for re-engineering. I think even without a large benefit a SIMD implementation would still be faster than the scalar version.

If you have 64-bit hash values, what is the probability that they will collide on their least significant 32 bits? Is it 1/2^32? Or 2e-8 percent?

The author of XXHash (the one I use) claims that “no bit can be used as a possible predictor for another bit.”

So my initial guess would be 1/2^32.

However, I did a practical test and I got:

1.161% collision rate on the lower 32-bits for 100M unique 64-bit hash values

2.305% collision rate for 200M

3.431% collision rate for 300M

Which leaves me somewhere in the 1e-8 ballpark.

I am considering whether this is an artifact of how I’m doing the hashing. But this means that the SIMD 32-bit intersection would be somewhat lossy.

Please see my post: Are 64-bit random identifiers free from collision? https://lemire.me/blog/2019/12/12/are-64-bit-random-identifiers-free-from-collision/