How much memory bandwidth do large Amazon instances offer?

In my previous post, I described how you can write a C++ program to estimate your read memory bandwidth. It is not very difficult: you allocate a large memory region and you read it as fast as you can. To see how much bandwidth you may have if you use multithreaded applications, you can use multiple threads, where each thread reads a section of the large memory region.

The server I used for the blog post, a two-CPU Intel Ice Lake server has a maximal bandwidth of about 130 GB/s. You can double this amount of bandwidth with NUMA-aware code, but it will require further engineering.

But you do not have access to my server. What about a big Amazon server? So I spun out an r6i.metal instance from Amazon. These servers can support 128 physical threads, they have 1 terabyte of RAM (1024 GB) and 6.25 GB/s of network bandwidth.

Running my benchmark program on this Amazon server revealed that they have about 115 GB/s of read memory bandwidth. That is not counting NUMA and other sophisticated tricks. Plotting the bandwidth versus the number of threads used reveals that, once again, you need about 20 threads to maximized memory bandwidth although you get most of it with only 15 threads.

My source code is available.

Daniel Lemire, "How much memory bandwidth do large Amazon instances offer?," in Daniel Lemire's blog, January 18, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

6 thoughts on “How much memory bandwidth do large Amazon instances offer?”

  1. Sure, stream has copy, scale, add, and triad.

    Another useful test I found is to take an array and set each location N to N+1. Then you can follow it with a simple loop:
    while (p) {
    p=a[p]
    }

    That’s a latency test, but prefetch still helps. To avoid prefetch you can do a knuth shuffle on the array. However that becomes a TLB thrashing test. Either limit the distance of the shuffle, or do a knuth shuffle per page to fix that and get the memory latency. Then do similar per thread to fully exercise the unshared and shared caches in the memory hierarchy.

    1. Sure, stream has copy, scale, add, and triad.

      Yes, so it is does not have “just access”. The simplest benchmark is “copy” where it adds the loads and the stores in the computation of the benchmark. So it is not a read-only benchmark, it is a read/write benchmark.

  2. I think this is a very rough estimate. The running times (completion times) of multiple threads vary greatly. When the tested memory is too large, the bank thrash caused by thread memory access contention will result in a smaller measured bandwidth.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.