Measuring energy usage: regular code vs. SIMD code

Modern processor have fancy instructions that can do many operations at one using wide registers: SIMD instructions. Intel and AMD have 512-bit registers and associated instructions under AVX-512.

You expect these instructions to use more power, more energy. However, they get the job done faster. Do you save energy overall? You should expect so.

Let us consider an example. I can just sum all values in a large array.

float sum(float *data, size_t N) {
  double counter = 0;
  for (size_t i = 0; i < N; i++) {
    counter += data[i];
  }
  return counter;
}

If I leave it as is, the compiler might be tempted to optimize too much, but I can instruct it to avoid ‘autovectorization’: it will not doing anything fancy.

I can write the equivalent function using AVX-512 intrinsic functions. The details do not matter too much, just trust me that it is expected to be faster for sufficiently long inputs.

float sum(float *data, size_t N) {
  __m512d counter = _mm512_setzero_pd();
  for (size_t i = 0; i < N; i += 16) {
    __m512 v = _mm512_loadu_ps((__m512 *)&data[i]);
    __m512d part1 = _mm512_cvtps_pd(_mm512_extractf32x8_ps(v, 0));
    __m512d part2 = _mm512_cvtps_pd(_mm512_extractf32x8_ps(v, 1));
    counter = _mm512_add_pd(counter, part1);
    counter = _mm512_add_pd(counter, part2);
  }
  double sum = _mm512_reduce_add_pd(counter);
  for (size_t i = N / 16 * 16; i < N; i++) {
    sum += data[i];
  }
  return sum;
}

Under Linux, we can ask the kernel about power usage. Your software can query the power usage of different components, but I query the overall power usage (per socket). It works well with Intel processors as long as you have privileged access on the system. I wrote a little benchmark that runs both functions.

On a 32-core Ice Lake processors, my results are as follows with a float array that spans about 500 megabytes.

code routine Power (muJ/s) Energy per value (muJ/value)
naive code 0.055 muJ/s 0.11 muJ/value
AVX-512 0.061 muJ/s 0.032 muJ/value

So the AVX-512 uses 3.5 times less energy overall, despite consuming 10% more energy per unit of time.

My benchmark also reports the memory usage of the memory subsystem. In this particular case, the memory usage caused by memory (DRAM) is low and even negligible, according to the kernel.

My benchmark is naive and should only serve as an illustration. The general principle holds, however: if your tasks complete much faster, you are likely to use less power, even if you are using more energy per unit of time.

Daniel Lemire, "Measuring energy usage: regular code vs. SIMD code," in Daniel Lemire's blog, February 19, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

10 thoughts on “Measuring energy usage: regular code vs. SIMD code”

    1. interesting test and result. Naive ok, but how did you measure overall? Wall outlet?

      And the muJ/s come from the same source? Did you use crocodile plugs?

      Lovely unit btw, muJ/s; much better than Watthour.

  1. interesting test and result. Naive ok, but how did you measure overall? Wall outlet?

    And the muJ/s come from the same source? Did you use crocodile plugs?

    Lovely unit btw, muJ/s; much better than Watthour.

  2. The table would be easier to grok if you used headers and moved the units into the header. Perhaps “Power (muJ/s)”, “Energy Per Value (muJ/value)”?

  3. Interesting comparison but the “general principle” does not really hold with quite a range of the most common processors. All the ones that can turbo boost, will sacrifice the optimal efficiency configuration for better short-term latency (e.g. doubling energy usage for 20-30% gains). Maybe a more general principle might be, that usually the wider you can run a computation with SIMD or multicore, the more efficient you can run it (for the same reason, it’s is better to invest the energy budget in more parallelism than into more freq). Does not work for all kinds of computation, though, and improvements might be diminishing.

    1. I think it does hold since nobody mentioned how “complete much faster” is to be achieved, with or without turbo-boost or with or without SIMD.

      Even with turbo-boost disabled results aren’t likely to change relative to reach other, since disabling the turbo-boost will also affect the overall CPU performance. And thus will also negatively impact the performance of SIMD.

      So, generally speaking, relative performance will remain the same with or without turbo-boost and with or without SIMD.

  4. Ice Lake is known to downclock when running AVX-512 instructions.

    Ran the benchmark within docker on 7950x3d /w 5200MT/s with the following results:

    Number of RAPL packages: 1
    Initializing RAPL package 0
    Initializing array

    Testing package (CPU socket)
    trial 0
    sum 120040000.000000
    package 0 energy (uj): 3286604, per iter 0.027379 per nano 0.046967
    sumvec 120040000.000000
    package 0 energy (uj): 580185, per iter 0.004833 per nano 0.052460

    trial 1
    sum 120040000.000000
    package 0 energy (uj): 3615551, per iter 0.030120 per nano 0.052015
    sumvec 120040000.000000
    package 0 energy (uj): 528782, per iter 0.004405 per nano 0.051942

    trial 2
    sum 120040000.000000
    package 0 energy (uj): 3515305, per iter 0.029284 per nano 0.050362
    sumvec 120040000.000000
    package 0 energy (uj): 543017, per iter 0.004524 per nano 0.050924

    Testing core (CPU)
    trial 0
    sum 120040000.000000
    package 0 energy (uj): 36588, per iter 0.000305 per nano 0.000524
    sumvec 120040000.000000
    package 0 energy (uj): 8621, per iter 0.000072 per nano 0.000781

    trial 1
    sum 120040000.000000
    package 0 energy (uj): 13885, per iter 0.000116 per nano 0.000200
    sumvec 120040000.000000
    package 0 energy (uj): 686, per iter 0.000006 per nano 0.000064

    trial 2
    sum 120040000.000000
    package 0 energy (uj): 5493, per iter 0.000046 per nano 0.000079
    sumvec 120040000.000000
    package 0 energy (uj): 443, per iter 0.000004 per nano 0.000044

    1. Thanks for publishing your numbers.

      Ice Lake is known to downclock when running AVX-512 instructions.

      It is not my experience. The sustained frequency is not directly affected by AVX-512 per se. Under all Intel processors, there is a small penalty when you start using AVX-512 instructions, however. But it is not very significant unless you are benchmarking tiny tasks.

  5. Aloha
    Abstract—On-chip thermal hotspots are becoming one of the primary
    design concerns for next generation processors. Industry chip design
    trends coupled with post-Dennard power density scaling has led to a
    stark increase in localized and application-dependent hotspots. These
    “advanced” hotspots cause a variety of adverse effects if untreated, ranging from dramatic performance loss, incorrect circuit operation, and reduced device lifespan.
    https://sites.tufts.edu/tcal/files/2021/11/HotGauge_IISWC_2021.pdf

    https://sites.tufts.edu/tcal/files/2021/11/HotGauge_IISWC_2021_slides.pdf

    https://youtu.be/61VJ7KJAgnM?si=s2i3FQ3DsYbK1EAB&t=65

    To fix it — https://labpartnering.org/patents/US11671054
    an energy reservoir, the missing aspect of previously attempted adiabatic computational systems.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.