C++ remains one of the most popular languages today. One of the benefits of C++ is the built-in STL containers offering the standard data structures like vector, list, map, set. They are clean, well tested and well documented.
STL containers have a few downsides. Getting the very best performance out of them can be tricky in part because they introduce so much abstraction.
Another potential problem with them is their memory usage. This is a “silent” problem that will only affect you some of the time, but when it does, it may come as a complete surprise.
I was giving a talk once to a group of developers, and we were joking about the memory usage of modern data structures. I was telling the audience that using close to 32 bits per 32-bit value stored in a container was pretty good sometimes. The organizer joked that one should not be surprised to use 32 bytes per 32-bit integer in Java. Actually, I don’t think the organizer was joking… he was being serious… but the audience thought he was joking.
I wrote a blog post showing that each “Integer” in Java stored in an array uses 20 bytes and that each entry in an Integer-Integer map could use 80 bytes.
Java is sometimes ridiculous in its memory usage. C++ is better, thankfully. But it is still not nearly as economical as you might expect.
I wrote a small test. Results will vary depending on your compiler, standard library, the size of the container, and so forth… You should run your own tests… Still, here are some numbers I got on my Mac:
|Storage cost in bytes per 32-bit entry|
(My Linux box gives slightly different numbers but the conclusion is the same.)
So there is no surprise regarding std::vector. It uses 4 bytes to store each 4 byte elements. It is very efficient.
However, both std::set and std::unordered_set use nearly an order of magnitude more memory than would be strictly necessary.
The problem with the level of abstraction offered by C++ is that you can be completly unaware of how much memory you are using.
19 thoughts on “The memory usage of STL containers can be surprising”
There’s also std::forward_list, which has pointer less and thus consumes less, as expected (16b on my Linux box).
It looks as though you are counting memory allocated without the overhead of heap structures … which could possibly add up to quite a lot.
Yes, the numbers I offer are “optimistic”. I am assuming that the overhead is a constant quantity that becomes irrelevant if you have enough data.
In practice, you can expect 16B overhead per heap allocation (64b glibc malloc, but most do the same AFAIK). There is a single allocation for the vector/deque, but one per element for the others, so not quite negligible.
At first I was puzzled with your results for unordered_set and deque, but it makes sense.
unordered_set uses singly-linked buckets and stores the hash: already 24B/node + a table of pointers. deque (like vector) overallocates: depending on the size you will measure different overheads but 2x isn’t surprising.
Depends on whether the STD code does lots of small allocations, and the heap implementation. My guess is there at least one word of overhead, per allocation.
That’s a good point. I am not sure how to improve my benchmark to take this into account. Got any ideas? I can track carefully all allocations, I just don’t know how to measure their *actual* cost. Any standard way to do that?
Sorry, that was an incomplete statement. You want to compute the number of allocations, and thus the average size of an allocation. The overhead is usually fixed per allocation, usually at least one (32-bit or 64-bit) word. Depends on the heap manager code. (I have not looked at the GNU library heap code in a few decades. Your runtime library may be different.)
Note also that the general-purpose heap manager has a compute cost on allocation and de-allocation. When the pattern of usage is not random, you can use allocation pools (few/large allocations from the general-purpose heap) and free-lists to make huge reductions in memory and CPU usage.
I only use C++ when I need bare-metal performance, so these optimizations are usually relevant.
If you do a lot of allocation/de-allocation of large numbers of small items, the cost of the heap can a problem.
The general purpose heap code has to be thread-safe, which adds an additional cost. Application code can often be made thread-safe at a higher level, with lower cost.
You theme in this post is exactly relevant. I use C++ when I need bare-metal performance. STL is a general purpose library, that puts generality ahead of performance. Frankly, writing exactly-focused array/list/hash classes is pretty easy. So … why would I use STL?
Of course, some parts of even highly performant applications do not need to be especially efficient. Perhaps STL is a good fit there.
If I had deep knowledge of the STL implementation, I would know when the STL code was as efficient as the code I might write. That knowledge would require an investment of time. If I can write an write an equivalent class in the same time, hard to justify the time to study (and periodically re-studying) the STL.
If I were not very good at understanding and implementing algorithms, using the STL written by (hopefully) better programmers would be an excellent idea. I would have a very hard time justifying that case, as I believe we are long past the time when average programmers should be using C++.
Which leaves the STL in an odd corner.
It is true in any language that most of the code you write does not need to be particularly fast or memory conscious. The bottlenecks are often very specific.
In C++, at least, you can rewrite the problematic code so it is fast and memory conscious.
Particularly with respect to the unordered_set, there’s the question of whether it’s “abstraction” or “uniformity” that’s the problem. From my (admittedly poor) understanding, the C++ specification of an unordered_set and its behaviour with respect to object storage pretty much requires a hash table with separate chaining in the buckets for an unknown general class. In that case the memory usage is maybe double what you might hope to achieve with really careful optimization. As you say, for a type like an int32_t you can use other representations to use much less memory and still get “as if” behaviour with respect to the guarantees stated in the standard. At which point it becomes a question: should the STL include such specialisations of the STL containers, or would that be better done in a different library (boost, or something different yet again)?
I don’t know the answer to that question, but I do know (from working at companies who have “reimplemented the STL to be more efficient”) that something that violates the guarantees in the standard discussed all over the place (stackoverflow answers, etc) in some corner cases is NOT a nice programming environment.
My instinct would not be to reimplement STL. Rather, I would create new, specialized data structures with their own documentation and API.
Of course, the biggest “problem” I see with stl containers is that people either use the wrong one for their job (and performance requirements is part of that), or they fail to use them (and implement their own).
The problem in the real (non academic) world with implementing one’s own “super efficient” version, is that people in the private sector switch jobs frequently, and it is often the case that one has to take over code written by someone else. For all its flaws (which are few in my view), the stl is well documented and understood. Joe Blows’ super-set class – less so. The limitations of Joe Blows’ super-set class will have long been forgotten until they manifest as bugs when someone uses the class in a manner which only Joe knew would not be supported.
We have been using the boost libraries for some time, and we find they offer true (cross platform) portability, and substantially reduce bug counts, while maintaining or improving performance. I haven’t noticed memory utilization go up over our legacy code – but then our legacy code had many leaks and corner case that never worked properly.
The problem in the real (non academic) world with implementing one’s own â€œsuper efficientâ€ version, is that people in the private sector switch jobs frequently, and it is often the case that one has to take over code written by someone else. For all its flaws (which are few in my view), the stl is well documented and understood. Joe Blows’ super-set class â€“ less so. The limitations of Joe Blows’ super-set class will have long been forgotten until they manifest as bugs when someone uses the class in a manner which only Joe knew would not be supported.
You may have noticed that my blog post was not critical at all of STL. I share your feeling. STL is well known, well tested, well documented. *Except* that when it comes to performance-sensitive tasks, STL can be tricky.
I’d like to respond to your criticism of “academic work”. I don’t think it is particularly common for academics to write their own data structure code. That’s probably a lot more common in industry.
Academic software is slow and broken, but that’s not because the authors shy away from things like STL.
Indeed, the memory usage of std::unordered_set can be a serious issue for big data problems. In that case, consider using the sparse_hash_set from sparsepp, which is faster than std::unordered_set, but can fit 2 to 3 times more integers in the same memory. https://github.com/greg7mdp/sparsepp
article is good but it misses some important stuff that would make it much better:
Overhead is constant per element, so if you had sets of shorts it would look even worse for structures with pointers. If you used
struct Point that has 2 doubles overhead would be smaller.
Also vector has 0 overhead only if you know its size in advance and you do reserve. Naive push_back (that is sometimes needed when you do no know how many elements you will get) can cause a lot of waste. Resize factor for gcc is 2 so in theory 49.99% of memory can be wasted.
I should point out that this was *measured* memory usage (minus the object overhead).
Regarding std::vector, one can use shrink_to_fit (something I did not do here).
I just want to point that memory usage for
std::setis not linear to its size, and the overhear might be smaller for larger datasets.
I use 1024 elements, for smaller sets, it is indeed likely that the memory usage per element could be greater.
I do expect the memory usage to be linear, for large enough sets (I am guessing that 1024 is “large enough”).
What are the costs per entry when entries are larger? I imagine that the additional memory from a linked list comes from pointers, which stay small as the size of the entries increases. If this is the case, std::list should become more efficient when the entries themselves are large. Meanwhile a hash table might allocate enough space to store 2-4 times as many entries as it actually contains. So I would expect std::unordered_set to keep a ratio similar to 36:4 even for large entries.
I agree that a linked list probably becomes more efficient as the values get large. Note, however, that if your values are large, it becomes inefficient to copy them so you want to be using memory pointers.
You may subscribe to this blog by email.