Estimating your memory bandwidth

One of the limitations of a compute is the memory bandwidth. For the scope of this article, I define “memory bandwidth” as the maximal number of bytes you can bring from memory to the CPU per unit of time. E.g., if your system has 5 GB/s of bandwidth, you can read up to 5 GB from memory in one second.

To measure this memory bandwidth, I propose to read data sequentially. E.g., you may use a function where we sum the byte values in a large array. It is not necessary to sum every byte value, you can skip some because the processor operates in units of cache lines. I do not know of a system that uses cache lines smaller than 64 bytes, so reading one value every 64 bytes ought to be enough.

uint64_t sum(const uint8_t *data,
    size_t start, size_t len, size_t skip) {
  uint64_t sum = 0;
  for (size_t i = start; i < len; i+= skip) {
    sum += data[i];
  }
  return sum;
}

It may not be good enough to maximize the bandwidth usage: your system has surely several cores. Thus we should use multiple threads. The following C++ code divides the input into consecutive segments, and assigns one thread to each segment, dividing up the task as fairly as possible:

size_t segment_length = data_volume / threads_count;
size_t cache_line = 64;
for (size_t i = 0; i < threads_count; i++) {
  threads.emplace_back(sum, data, segment_length*i,
       segment_length*(i+1), cache_line);
}
for (std::thread &t : threads) {
  t.join();
}

I ran this code on a server with two Intel Ice Lake  processors. I get that the more threads I use, the more bandwidth I am able to get up to around 15 threads. I start out at 15 GB/s and I go up to over 130 GB/s. Once I reach about 20 threads, it is no longer possible to get more bandwidth out of the system. The system has a total of 64 cores, over two CPUs. My program does not do any fiddling with locking threads to cores, it is not optimized for NUMA. I have transparent huge pages enabled by default on this Linux system.

My benchmark ought to be make it easy for the processor to maximize bandwidth usage, so I would not expect more complicated software to hit a bandwidth limit with as few as 20 threads.

My source code is available.

This machine has two NUMA nodes. You can double the bandwidth by running the same benchmark using two NUMA nodes. E.g., under Linux you might call:

numactl --cpunodebind=1 --membind=1 ./bandwidth  & numactl --cpunodebind=0 --membind=0 ./bandwidth

Be aware that NUMA has some downsides. For example, the communication between NUMA nodes is relatively expensive.

Further reading: Many tools to measure bandwidth. I also wrote a second blog post on this theme: How much memory bandwidth do large Amazon instances offer?

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

5 thoughts on “Estimating your memory bandwidth”

  1. Why 15 threads? What becomes a bottleneck at that point – is that memory unit, shared memory controller, cpu or something else?

    1. The physical operating frequency of the memory chips and the bit width of the memory bus determine the ultimate limits. For example, DDR4 has a bus width of 64 bits per channel (72 bits with ECC), and its typical data rate is 3200 MT/s at the end of the DDR4 lifecycle. If a server CPU contains a 4-channel memory controller, the bus width is 256-bit. At 3200 MT/s, its theoretical bandwidth of the memory bus is (64 * 4 / 8) * 3200e6 / 1e9 = 102.4 GB/s. Note that by convention, memory bandwidth is almost always reported in GB/s, not GiB/s. This relationship can be verified directly by memory downclocking (or overclocking) then re-running the tests.

      Next, usable bandwidth somewhat depends on the memory controller itself. From experience, the measured throughput will always be slightly lower, usually at around 80% of the theoretical maximum due to memory controller inefficiencies. Switching between reads and writes in a read-write sequence also has a measurable overhead compared to a read-only test.

      Then final the question is, why 15 threads? 15 is an arbitrary number, so a better question is: why not 1 thread? The answer is that a single CPU core simply does not have sufficient memory-level parallelism and cannot generate enough pending memory requests to utilize the bandwidth. For example, to quote McCalpin:

      (A) For a single thread you are almost always working in a concurrency-limited regime. For example, my Intel Xeon E5-2680 (Sandy Bridge EP) dual-socket systems have an idle memory latency to local memory of 79 ns and a peak DRAM bandwidth of 51.2 GB/s (per socket). If we assume that some cache miss handling buffer must be occupied for approximately one latency per transaction, then queuing theory dictates that you must maintain 79 ns * 51.2 GB/s = 4045 Bytes “in flight” at all times to “fill the memory pipeline” or “tolerate the latency”. This rounds up to 64 cache lines in flight, while a single core only supports 10 concurrent L1 Data Cache misses. In the absence of other mechanisms to move data, this limits a single thread to a read bandwidth of 10 lines * 64 Bytes/line / 79 ns = 8.1 GB/s. –

      Vectorized code can obtain significantly higher single-core memory bandwidth as it touches more cachelines. But in multi-core scaling, it only means that bandwidth saturation occurs earlier, so its usefulness is limited for bandwidth-limited code.

      If anyone’s interested in memory bandwidth, one should just read every single article from the last 10 years in John “Dr. Bandwidth” McCalpin’s blog. His STREAM Triad benchmark was highly influential for raising the awareness of memory bandwidth in the 1990s, it’s still a common test and McCalpin is still working in the same field today.

      The evolution of single-core bandwidth in multicore processors
      https://sites.utexas.edu/jdm4372/2023/04/25/the-evolution-of-single-core-bandwidth-in-multicore-processors/
      Notes on “non-temporal” (aka “streaming”) stores https://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

    1. Lower memory bandwidth is the most obvious bottleneck. For local memory accesses on NUMA systems, the CPU uses its own memory controller directly. But to access remote memory owned by another CPU socket, the traffic is first read by the remote CPU, then it’s passed across the NUMA interconnect. Thus, the ideal scenario for NUMA is that a bandwidth-heavy application should split its working set evenly across sockets (usually using “first-touch”). Since each socket has its own memory controller, memory bandwidth adds up linearly, and the system memory bandwidth would be the bandwidth per socket multiplied by the number of sockets.

      Problems arise when the application does not split its working set properly. If you have a multi-socket system but all the memory is owned by a single socket – which is a common occurrence in NUMA-unaware code when data initialization is done inside the main thread only (and without OS-level workarounds such as forcing NUMA interleaving) – as the stress is placed on the memory controller of a single CPU socket, in the worst case when all sockets are busy at the same time, you only get single-socket bandwidth, nullifying the benefit of NUMA entirely. This means the total bandwidth drops to 50% on 2-socket systems, and to 25% on 4-socket systems.

      If only one socket is busy at a time, there’s still a significant penalty. Let’s ignore the obvious latency issue and only consider bandwidth. On Intel CPUs, the QPI/UPI bandwidth is usually 50% as fast as the local memory bus. For example, on Skylake, the UPI bandwidth is 41.6 GB/s (full-duplex), while the local memory bandwidth per node is 128 GB/s (half-duplex), so remote memory bandwidth is around 50% of local memory bandwidth. There’s also a second-order effect – extra NUMA interconnect traffic is needed for “snooping” to maintain memory coherency.

      For memory bandwidth limited applications, explicit NUMA partitioning is crucial for utilizing the machine fully.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.