Allocating large blocks of memory: bare-metal C++ speeds

In a previous post, I benchmarked the allocation of large blocks of memory using idiomatic C++. I got a depressing result: the speed could be lower than 2 GB/s. For comparison, the disk in my laptop has greater bandwidth.

Methodologically, I benchmarked the “new” operator in C++ with initialization, using the GNU GCC compiler with the -O2 optimization flag1.

char *buf = new char[s]();

It turns out that you can do better while sticking with C++. We cannot simply call the new operator without initialization because, in general, it does not result in the memory being actually allocated. However, we can allocate the memory and then make sure that we touch every “page” of memory. On modern Intel systems, pages are effectively always at least as large of 4kB, so we can touch the memory every 4kB. We might call this approach “new and touch”.

char * buf = new char[size];
for (size_t i = 0; i < size; i += 4096) buf[i] = 0;
buf[size - 1] = 0;

Such a new-and-touch strategy should be close to “bare-metal” memory allocation speeds. So how fast is it? It depends on the page size. By default, most systems rely on small (4kB) pages. Allocating many small pages is expensive. Thankfully, Linux can be configured to use huge pages, transparently, via a feature called “transparent huge pages”. And it makes a huge difference!

Allocating 512MB Setting 512MB to zero
regular pages (4kB) 3.9 GB/s 30 GB/s
transparent huge pages 20 GB/s 30 GB/s

I am using a recent Linux system (Ubuntu 16.04), a Skylake processor and GNU GCC 8.3 with the -O2 optimization flag. My source code is available.

It is still the case that allocating memory on most systems is a non-trivial cost since they rely on small 4kB pages. There are fast disks available on the market that have more than 4 GB/s of bandwidth.

Credit: Thanks to Travis Downs and others for their insights and comments.

 

1 Downs found that we get far better performance out of the new operator with initialization under GNU GCC with the more agressive -O3 optimization flag. Performance-wise, it should be close to the “new and touch” approach that I am describing.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

7 thoughts on “Allocating large blocks of memory: bare-metal C++ speeds”

  1. My best interpretation of the result is that they are dominated by the Linux kernel’s mmap_sem. That lock protects all address space operations (mmap, munmap, mprotect, etc.) plus a few seemingly unrelated things. And importantly it is a reader-writer-lock. Mmap is considered a writer, so it will be single-threaded. Page fault is considered a reader, so you can fault in pages on all CPUs in parallel.

    In the previous benchmark, you presumably ended up calling mmap with MAP_POPULATE, so the entire memory allocation was done in one large critical section. This benchmark presumably does not, so the kernel will only give you virtual memory without any actual memory backing things. Then your initialization loop is faulting pages in one-by-one.

    Transparant huge pages should be faster once the page fault is handling 2MiB at a time, not 4kiB. Might take a while until this kicks in and you might get different results if the system has been running long enough for fragmentation to kick in. If you cannot find 512 contiguous small pages, you cannot create a huge page.

    Back to the mmap_sem, your userspace code still looks single-threaded. But the kernel page fault handler can grab a free page and return, while another kernel thread on another CPU detects that the free page pool is getting low, runs the LRU list and creates new free pages. With careful profiling you might detect the kernel background work.

    And if my theory is correct, you should get noticeably worse performance if you reboot with a single CPU and rerun the test.

    1. The difference is simply because of the huge overhead of trapping into the kernel to allocate and zero a page (and that overhead has increased due to all the security issues).

      With huge pages the number of page faults reduces dramatically, so we’re limited by the speed of memset. To improve performance various systems use a larger default pagesize: eg. iOS uses 16KB, and some AArch64 servers use 64KB.

      Note pages are always cleared on a pagefault in Linux (even huge pages), unlike on Windows there is no background process which clears pages.

  2. Thank you for your nice article.

    I performed some further experiments by tuning malloc parameters. https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html

    My Baseline

    65536 pages 256 MB calloc 92.287 ms 2.71 GB/s
    65536 pages 256 MB new_and_touch 87.187 ms 2.87 GB/s
    65536 pages 256 MB new_and_memset 106.538 ms 2.35 GB/s
    65536 pages 256 MB new_and_value_init 188.074 ms 1.33 GB/s
    65536 pages 256 MB new_and_value_init_nothrow 191.691 ms 1.30 GB/s
    65536 pages 256 MB memset_existing_allocation 22.110 ms 11.31 GB/s
    65536 pages 256 MB mempy_into_existing_allocation 32.831 ms 7.61 GB/s
    65536 pages 256 MB mmap_populate 52.721 ms 4.74 GB/s

    With mallopt(M_MMAP_MAX, 0); results for 256 MB changed:

    65536 pages 256 MB calloc 92.635 ms 2.70 GB/s
    65536 pages 256 MB new_and_touch 0.876 ms 285.44 GB/s
    65536 pages 256 MB new_and_memset 22.872 ms 10.93 GB/s
    65536 pages 256 MB new_and_value_init 186.411 ms 1.34 GB/s
    65536 pages 256 MB new_and_value_init_nothrow 189.933 ms 1.32 GB/s
    65536 pages 256 MB memset_existing_allocation 23.459 ms 10.66 GB/s
    65536 pages 256 MB mempy_into_existing_allocation 33.645 ms 7.43 GB/s
    65536 pages 256 MB mmap_populate 56.151 ms 4.45 GB/s

    Measurement values for larger allocations don’t show this difference.

    I wonder if there is some temporal influence between the different allocation methods.

  3. If you compare the code generated by the compiler in -O2 vs -O3 (https://godbolt.org/z/t2ifR7) you can see why -O2 lacks a bit in the speed department:

    -O2: Writes one byte element at a time

    -O3: Writes a qword worth of elements at a time (similar to what an optimized memset() would do)

    1. Yes that’s a GCC bug – GCC9.2 still gets it wrong, but it works as expected on GCC10 trunk. It’s never a good idea to emit a simple byte loop.

  4. I have trouble reproducing the results. I am using Ubuntu 18.04, Piledriver processor, GNU gcc 8.3.1.

    $ ./alloc
    131072 pages 512 MB new_and_touch 216.895 ms 2.31 GB/s
    131072 pages 512 MB memset_existing_allocation 91.879 ms 5.44 GB/s
    $ sudo ./run_with_huge_pages.sh ./alloc
    131072 pages 512 MB new_and_touch 107.466 ms 4.65 GB/s
    131072 pages 512 MB memset_existing_allocation 94.135 ms 5.31 GB/s

    I got similar results (i.e. no speed increase) with gcc 7.4.0, and on a Kaby Lake with Fedora 29 and gcc 8.3.1. Could you please help me out on this?

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax