AVX-512: when and how to use these new instructions

Our processors typically do computations using small data stores called registers. On 64-bit processors, 64-bit registers are frequently used. Most modern processors also have vector instructions and these instructions operate on larger registers (128-bit, 256-bit, or even 512-bit). Intel’s new processors have AVX-512 instructions. These instructions are capable of operating on large 512-bit registers. They … Continue reading AVX-512: when and how to use these new instructions

Per-core frequency scaling and AVX-512: an experiment

Intel has fancy new instructions (AVX-512) that are powerful, in part for heavy numerical work. When a core uses these heaviest of these new instructions, the core’s frequency comes down to maintain the power usage within bounds. I wanted to test it out so I wrote a little threaded program. It runs on four threads … Continue reading Per-core frequency scaling and AVX-512: an experiment

AVX-512 throttling: heavy instructions are maybe not so dangerous

Recent Intel processors have fancy instructions operating over 512-bit registers. They are reported to cause a frequency throttling of the core where they are run, and possibly of other cores in some cases. Thus, it has been recommended to avoid AVX-512 instructions. I have written a series of blog posts on the topic trying to … Continue reading AVX-512 throttling: heavy instructions are maybe not so dangerous

Trying harder to make AVX-512 look bad: my quantified and reproducible results

Intel’s latest processors have fancy instructions part of the AVX-512 family. The AVX-512 instructions are useful for numerical work and sophisticated computing (e.g., cryptography, multimedia), but not necessarily useful for mundane tasks. Intel documents that the use of AVX-512 instructions can lower the frequency of the processor. How big the effect is depends on the … Continue reading Trying harder to make AVX-512 look bad: my quantified and reproducible results

The dangers of AVX-512 throttling: a 3% impact

Intel’s latest processors come with powerful new instructions from the AVX-512 family. These instructions operate over 512-bit registers. They use more power than regular (64-bit) instructions. Thus, on some Intel processors, the processor core that is using AVX-512 might run at a lower frequency, to keep the processor from overheating or using too much power. … Continue reading The dangers of AVX-512 throttling: a 3% impact

Vectorizing random number generators for greater speed: PCG and xorshift128+ (AVX-512 edition)

Most people designing random number generators program using regular code. If they are aiming for speed, they probably write functions in C. However, our processors have fast “vectorized” (or SIMD) instructions that can allow you to go faster. These instructions do several operations at once. For example, recent Skylake-X processors from Intel can do eight … Continue reading Vectorizing random number generators for greater speed: PCG and xorshift128+ (AVX-512 edition)

Arbitrary byte-to-byte maps using ARM NEON?

Modern processors have fast instructions that can operate on wide registers (e.g., 128-bit). ARM processors, the kind of processors found in your phone, have such instructions called “NEON”. Sebastian Pop pointed me to some of his work doing fast string transformations using NEON instructions. Sebastian has done some great work to accelerate the PHP interpreter … Continue reading Arbitrary byte-to-byte maps using ARM NEON?

Parsing JSON using SIMD instructions on the Apple A12 processor

Most modern processors have “SIMD instructions“. These instructions operate over wide registers, doing many operations at once. For example, you can easy subtract 16 values from 16 other values in a single SIMD instruction. It is a form of parallelism that can drastically improve the performance of some applications like machine learning, image processing or … Continue reading Parsing JSON using SIMD instructions on the Apple A12 processor

Really fast bitset decoding for “average” densities

Suppose I give you a word and you need to determine the location of the 1-bits. For example, given the word 0b100011001, you would like to get 0,3,4,8. You could check the value of each bit, but that would take too long. A better approach is use the fact that modern processors have fast instructions … Continue reading Really fast bitset decoding for “average” densities

How quickly can you compute the dot product between two large vectors?

A dot (or scalar) product is a fairly simple operation that simply sums the many products: float sum = 0; for (size_t i = 0; i < len; i++) { sum += x1[i] * x2[i]; } return sum; It is nevertheless tremendously important. You know these fancy machine learning algorithms we keep hearing about? If … Continue reading How quickly can you compute the dot product between two large vectors?

Validating UTF-8 strings using as little as 0.7 cycles per byte

Most strings found on the Internet are encoded using a particular unicode format called UTF-8. However, not all strings of bytes are valid UTF-8. The rules as to what constitute a valid UTF-8 string are somewhat arcane. Yet it seems important to quickly validate these strings before you consume them. In a previous post, I … Continue reading Validating UTF-8 strings using as little as 0.7 cycles per byte

Iterating over set bits quickly (SIMD edition)

Suppose that you have a long sequence of bits 10101011100000… you want to visit all the bits set to 1. That is, given 10101011100000, you would like to get the indexes of all the bits set to one: 0,2,4,6,7,8,9. In a recent blog post, I reviewed fast techniques to iterate over the position of the … Continue reading Iterating over set bits quickly (SIMD edition)

How fast can you bit-interleave 32-bit integers? (SIMD edition)

In a previous post, I asked how fast one could interleave the bits between two 32-bit integers. That is, given 0b1011 (11 in decimal) and 0b1100 (12 in decimal), I want the number 0b11011010. On recent (2013) Intel processors, the answer is that you process one pair of integers every two cycles using the pdep … Continue reading How fast can you bit-interleave 32-bit integers? (SIMD edition)

Don’t make it appear like you are reading your own recent writes

Richard Statin recently published a Java benchmark where the performance of a loop varies drastically depending on the size of the arrays involved. The loop is simple: for (int i = 0; i < a.length; ++i) { a[i] += s * b[i]; } If the array size is 1000, the performance is lower than if … Continue reading Don’t make it appear like you are reading your own recent writes

How should you build a high-performance column store for the 2020s?

Though most relational databases (like MySQL) are “row oriented”, in that they keep rows stored together… experience has taught us that pivoting the data arrow so that columns, and not rows, are stored together can be beneficial. This is an old observation that most experienced programmers know about as the array-of-struct versus struct-of-array choice. There … Continue reading How should you build a high-performance column store for the 2020s?

Can your C compiler vectorize a scalar product?

If you have spent any time at all on college-level mathematics, you have probably heard of the scalar product: float scalarproduct(float * array1, float * array2, size_t length) { float sum = 0.0f; for (size_t i = 0; i < length; ++i) { sum += array1[i] * array2[i]; } return sum; } Most people who … Continue reading Can your C compiler vectorize a scalar product?

Intel will add deep-learning instructions to its processors

Some of the latest Intel processors support the AVX-512 family of vector instructions. These instructions operate on blocks of 512 bits (or 64 bytes). The benefit of such wide instructions is that even without increasing the processor clock speed, systems can still process a lot more data. Most code today operators over 64-bit words (8 … Continue reading Intel will add deep-learning instructions to its processors

Faster dictionary decoding with SIMD instructions

A particularly fast and effective compression technique is dictionary coding. Intuitively, it works as follow. Suppose you are given a long document made of millions of words, but containing only 65536 distinct words. You can create a map from words to short integers or indexes (in [0,65536)). So the word “the” might be replaced by … Continue reading Faster dictionary decoding with SIMD instructions