In C++, the most basic memory allocation code is just a call to the new operator:
char *buf = new char[s];
According to a textbook interpretation, we just allocated s bytes1.
If you benchmark this line of code, you might find that it almost entirely free on a per-byte basis for large values of s. But that is because we are cheating: the call to the new operation “virtually” allocates the memory, but you may not yet have actual memory that you can use. As you access the memory buffer, the system may then decide to allocate the memory pages (often in blocks of 4096 bytes). Thus the cost of memory allocation can be hidden. The great thing with a virtual allocation is that if you never access the memory, you may never pay a price for it.
If you actually want to measure the memory allocation in C++, then you need to ask the system to give you s bytes of allocated and initialized memory. You can achieve the desired result in C++ by adding parentheses after the call to the new operator:
char *buf = new char[s]();
Then the operating system actually needs to allocate and initialize memory2. It may still cheat in the sense that it may recycle existing blocks of memories or otherwise delay allocation. And I expect that it might do so routinely if the value of s is small. But it gets harder for the system to cheat as s grows larger.
What happens if you allocate hundreds of megabytes in such a manner? The answer depends on the size of the pages. By default, your system probably uses small (4kB) pages. Under Linux, you can enable “transparent huge pages” which dynamically switches to large pages when large blocks of memory are needed. Using larger pages means having to allocate and access fewer pages, so it tends to be cheaper.
In both instances, I get around a couple of gigabytes per second on a recent Linux system (Ubuntu 16.04) running a conventional Skylake processor. For comparison, you can set memory to zero at tens gigabytes per second and my disk can feed data to the system at more than 2 GB/s. Thus, at least on the system I am currently using, memory allocation is not cheap. My code is available, I use GNU GCC 8.3 with the -O2 optimization flag.
|Allocating 512MB||Setting 512MB to zero|
|regular pages (4kB)||1.6 GB/s||30 GB/s|
|transparent huge pages||2.4 GB/s||30 GB/s|
You can do better with different C++ code, see my follow-up post Allocating large blocks of memory: bare-metal C++ speeds.
Further remarks. Of course, you can reuse the allocated memory for greater speeds. The memory allocator in my standard library could possibly do this already when I call the new operator followed by the delete operator in a loop. However, you still need to allocate the memory at some point, if only at the beginning of your program. If you program needs to allocate 32 GB of memory, and you can only do so at 1.4 GB/s, then your program will need to spend 23 seconds on memory allocation alone.
- Several readers have asked why I am ignoring C functions like calloc, malloc and mmap. The reason is simple: this post is focusing on idiomatic C++.
- You might wonder why the memory needs to be initialized. The difficulty has to do with security. Though it is certainly possible for the operating system to give you direct access to memory used by another process, it cannot also pass along the data. Thus it needs to erase the data. To my knowledge, most systems achieve this result by zeroing the new memory.
22 thoughts on “How fast can you allocate a large block of memory in C++?”
What about an actual huge page allocated with mmap instead of operator new? And what’s the allocator under test in your table?
And what’s the allocator under test in your table?
I am using GNU GCC 8.3 under Ubuntu 16.04.6.
In theory, a malloc library could cheat for calloc() and allocate the same read-only zero 4k page for every page in the region. And then still defer allocation for the first real write.
Not a theory. OS X has deferred zeroing each page as long as I can remember: https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/MemoryAlloc.html
I am have a mac, and if it did what I think Wayne means, it could achieve seemingly impossible speeds on my benchmark… yet it is no faster than my Linux box.
That is why I said calloc() instead of a C++ constructor for a char. I am not surprised that libstd++ doesn’t specialize initialization to call calloc() in this case.
Yes, it could. My benchmark should be viewed as a lower bound. It is yet possible that the system is cheating in all sorts of fun ways.
Possibly it can be faster if allocation is done normal way, without initialisation, but later you just “touch” each page (4K or 2M). Touching can also be parallelised which improves performance (tested that some time ago)
You are correct, allocating and then touching is faster in my tests.
It is much more basic to allocate memory on the stack instead of the heap. I would expect it to be much faster to allocate since you are just bumping a stack pointer. You will have to increase the maximum stack size though.
good luck allocating 512MB on the stack (I doubt it would be allowed OOTB on Linux, but i’d love to be proven wrong…) !
These are two other possibilities that you could have tested (I was expecting them, in fact), and that come closest to each other:
// malloc + memset (7.7 GB/s)
char *buf1 = (char*)malloc(s);
memset(buf1, 0, s);
// new char[s] + memset (9.4 GB/s)
char *buf1 = new char[s];
memset(buf1, 0, s);
(It is difficult to outperform calloc because the zeroing will be done by the kernel, I guess.)
Is this a reason why redis recommends to turn off THP?
cleaned code via github pull request and comments here:
Why not interleave allocation and initialization ?
E.g. you request a large block, but the allocator call (you’d need to write a custom allocator) would block on the first page, not the entire request.
E.g. in the background you would use the overcommit usage of the default allocator, then start async init of the pages and return access as each is completed.
Alternatively if you know your memory usage pattern an arena allocator could work faster here for you (think zero on delete).
The pages have to come from the operating system. If you are getting pages at 3 GB/s and your program needs 30 GB of memory, it is going to take 10 seconds. Writing a custom allocator is not going to solve this problem.
I understand, I think I misunderstood the way the OS releases the pages.
Linux supports page allocation without erasing previous contents as a kernel config somewhere. It would be interesting to compare allocation speed with and without it.
Oh, it’s only for no-MMU systems:
You are probably not getting huge pages, or few of them.
To get hugepages you have go jump through more hoops, e.g., allocate to a 2 MiB boundary, and do madvise on the memory before touching it. You can use mmap directly or one of the aligned allocators.
Better, you can check to see if you got huge pages: I wrote page-info to do that, integrating it is fairly easy (you can find integrations in some of our shared projects, including how to jump through the aforementioned hoops). Note that it only works on Linux.
For this and similar, like std::string* p = new std::string10, compiler just generates a code to call memset for whole area before of actual construction (ctor call) each object.
Maybe you refer to this C++ construction (link to the source code accompanying the post):
You may subscribe to this blog by email.