Intel will add deep-learning instructions to its processors

Some of the latest Intel processors support the AVX-512 family of vector instructions. These instructions operate on blocks of 512 bits (or 64 bytes). The benefit of such wide instructions is that even without increasing the processor clock speed, systems can still process a lot more data. Most code today operators over 64-bit words (8 bytes). In theory, keeping everything else constant, you could go 8 times faster by using AVX-512 instructions instead.

Of course, not all code can make use of vector instructions… but that’s not relevant. What matters is whether your “hot code” (where the processor spends much of its time) can benefit from them. In many systems, the hot code is made of tight loops that need to run billions of times. Just the kind of code that can benefit from vectorization!

The hottest trend in software right now is “deep learning”. It can be used to classify pictures, recognize speech or play the game of Go. Some say that the quickest “get rich quick” scheme right now is to launch a deep-learning venture, and get bought by one of the big players (Facebook, Google, Apple, Microsoft, Amazon). It is made easier by the fact that companies like Google have open sourced their code such as Tensorflow.

Sadly for Intel, it has been mostly left out of the game. Nvidia graphics processors are the standard off-the-shelf approach to running deep-learning code. That’s not to say that Intel lacks good technology. But for the kind of brute-force algebra that’s required by deep learning, Nvidia graphics processors are simply a better fit.

However, Intel is apparently preparing a counter-attack, of sort. In September of this year, they have discreetly revealed that their future processors will support dedicated deep-learning instructions. Intel’s AVX-512 family of instructions is decomposed in sub-families. There will be two new sub-families for deep-learning: AVX512_4VNNIW and AVX512_4FMAPS.

12 thoughts on “Intel will add deep-learning instructions to its processors”

    1. Intel has Altera’s acquisition also up their sleeves. They could more easily build learning assists, AI systems, etc. with flexible hardware. The challenge with FPGAs today is the toolchain is at least 10 years behind standard C, Java, etc. compilers. So there is a lot of work to do. We kind of need something like the defunct project from IBM called LIME or Maxeler’s OpenSPL efforts.

      To me the magic of the human mind is that the software and the wetware and the software evolve together. With silicon hardware and software there is a very slow mutual evolution.

  1. How exactly this kind of instructions are feeded by RAM?
    Because if you got 64B per tick (simplified), I suppose CPU run out of data supply pretty quick. RAM/cache bandwidth is way lower than this processing speed.

    1. We would need to see what the specifics are… My experience has been that the L1 cache is fast enough that even with AVX-512, cache speed is not a bottleneck. However, if you can’t load up the data in cache fast enough, my experience has been that RAM access speed is already a major bottleneck, even without any fancy instruction.

  2. If this is unspecific vector computation, is means deep learning is a hype label for what we have been enjoying with matlab/R/numpy vector computation for ages ? OK, deep learning has high impact, but it is much more specific than “vector computation”

    1. In other words, how will it compare to GPUs, performance-wise?

      Given that we do not even know what these instructions are, exactly, it is hard to know exactly.

      However, we can tell a few things from basic knowledge of Intel technology. Intel processors cannot compete on raw processing speed with GPUs. It could be that I am wrong, but I don’t think that these instructions will change this picture.

      GPUs are powerful, true… but they are also specialized. This makes them a poor fit for many common problems… whereas Intel’s CPU are much more broadly applicable.

      So, what if you have problems where deep learning is only part of what your system has to do? Then maybe the GPU is no longer the best solution. Maybe some Intel processor with both general purpose and deep learning capabilities becomes the best bet.

      We should keep in mind that Intel’s money comes (largely) from cloud infrastructures. Intel has to convince people like Amazon and Google to buy its processors. These people do a lot of work besides deep learning.

  3. I was disappointed to not actually see the details of the instructions, but I guess we can always speculate.

    I suspect that these will be tailored towards int8,int16,fp16,fp32 512 bit wide dot product and accumulate (useful for inference), and possibly instructions to accelerate FFTs and convolutions (similar to the SSE4.2 string instructions, but for convolutions instead of string matching).

    With a 64 wide int8 one per cycle throughput dot product (128 ops), which is more and more feasible for inference (but not training) of neural networks, a 32 core system could perform 128×32 = 4096 int8 ops per cycle, or around 8 TOPS on a 2GHz system. This is less than the 40TOPS the dp4a instruction can get on a Titan X, but it’s at least in the same ballpark. It probably wouldn’t burn too much area either.

    A 32 width fp16 dot product operator (64 flops/cycle) would be at 4TFLOPs, which compares favorably with the 10TFLOPs available on a Titan X, but would take significantly more area and probably need a deeper pipeline.

    Direct convolution instructions would play to the strengths of the SIMD model (eg, the ability to parallel shift like in the string instructions), and may be able to provide impressive performance especially in int8 mode.

Leave a Reply

Your email address will not be published. Required fields are marked *