Do not waste time with STL vectors

I spend a lot of time with the C++ Standard Template Library. It is available on diverse platforms, it is fast and it is (relatively) easy to learn. It has been perhaps to conservative at times: we only recently got a standard hash table data structure (with C++11). However, using STL is orders of magnitude safer than dealing with pointers.

I use the vector template in almost all my C++ software. Essentially, you can use it whenever you need a dynamic array. The title of my blog post is word play: please use vectors, but use them properly if you need speed!

When using the vector template in high performance code, we must be aware of several important facts. For example, the capacity (memory usage) of a vector will never decrease, even if you call the clear method, unless you use the swap method. In my opinion, making it difficult to release memory was a poor design decision. (This was fixed in the C++11 language update with the addition of the shrink_to_fit method.) Another problem with the vector template is that whenever you copy a vector, you copy all the data it contains. Again, to avoid this problem, you must use the swap method.

But a trickier problem with STL vectors is that there are many ways to build them. Suppose I want to create an array in C++ containing the numbers from 0 to N in increasing order. Then I want to compute the sum. I hear that mathematicians have come up with a formula for this problem, but your Intel microprocessor does not know this.

  • The conventional approach is to allocate the memory using C++’s new:
    int * bigarray = new int[N];
    for(unsigned int k = 0; k<N; ++k)
      bigarray[k] = k;
    int sum = total(bigarray,N);
    delete [] bigarray;
    return sum;
    
  • Or, you can do the exact equivalent in STL:
    vector<int> bigarray(N);
    for(unsigned int k = 0; k<N; ++k)
      bigarray[k] = k;
    int sum = total(bigarray,N);
    return sum;
    

    It is nicer because you don’t need to remember to recover the memory allocated. There is a price to pay however because when you first construct the vector, the memory is initialized to zero.

  • You can do it in pure STL fashion, with the push_back method:
    vector<int> bigarray;
    for(unsigned int k = 0; k<N; ++k)
      bigarray.push_back(k);
    int sum = total(bigarray,N);
    return sum;
    
  • We can improve the push_back method by using the fact that we know ahead of time how big the array will be. Thus, we can reserve the memory before the push_backs.
    vector<int> bigarray;
    bigarray.reserve(N);
    for(unsigned int k = 0; k<N; ++k)
      bigarray.push_back(k);
    int sum = total(bigarray,N);
    return sum;
    

    It is likely to be slightly faster.

So what is the result? On my Intel Core i7 desktop, I get the following numbers of CPU cycles per integer processed:

method cycles per integer 
C++ new8.5
STL vector8.9
push_back20.6
reserve + push_back12.6

As usual, my code is freely available.

Note that you can cheat and get the same speed as a C++ new by constructing the vector empty, then reserving the memory (reserve) and never actually resizing the array. It is outside the specifications and could potentially be unsafe, but it saves you up to half a CPU cycle per integer.

Conclusion STL vectors can be quite fast but if you use them to store integers, the push_back method can be relatively expensive.

33 thoughts on “Do not waste time with STL vectors”

  1. Hi Daniel,
    Thanks for this post, this is good timing for a project I’m working on. If you intend to investigate this kind of topic in more depth, I’d love to see some analysis of using vector<vector> for 2-d arrays of raw data. In particular I’m trying to understand how this representation affects memory locality, and how I can optimally allocate and traverse column-major blocks of data retrieved from a 3rd-party. The goal is to produce typed data values from the raw bytes of each column. Since each column may have a very different data type, the memory access stride will change when finishing one column and moving to the next, so this already impacts the hardware prefetcher. Are things made worse by the fact that each column may be allocated in very different locations of memory? Last, I have a separate 2-d array of per-cell status indicators, so this compounds the locality challenge.

    Thanks for maintaining this blog, I’ve found it very useful.
    -Robert

  2. In your code, yhy is the following not safe?

    int runtestnoalloc(size_t N, int * bigarray) {
    for(unsigned int k = 0; k<N; ++k)
    bigarray[k] = k;// unsafe

    int sum = 0;
    for(unsigned int k = 0; k<N; ++k)
    sum += bigarray[k];

    return sum;
    }

    The bigarray is never out of scope so shouldn't it be fine? Is it because bigarray[k] can overwrite something in the memory?

  3. @mars

    It is perfectly safe in this instance, but playing with pointers can lead to segmentation faults. How do you know that whoever called this function has allocated enough memory?

    Using STL containers is safer because you can check.

  4. If you really need the raw speed during construction, there’s another variant, which I’ve just sent you via a pull request on GitHub.

    The salient part is construction of the vector via a pair of iterators. But we don’t actually construct the range of values in memory (that would just shift the initialisation problem elsewhere), we use a special iterator type “iota” which generates the values on the fly. Creating this iterator incurs a bit of boilerplate code but it’s a general-purpose class that can easily be recycled.

    This is “good C++” because it uses high-level algorithmic building blocks rather than manual memory management or UB hacks, and furthermore increases the abstraction of the initialisation code (making the loop obsolete).

    Interestingly, this variant is *even faster* than the fastest other code (which is illegal C++ anyway and can’t really be counted) when compiled with optimisation level 3, and it’s on par with the other methods when compiled with O2.

  5. Was this for g++ ? I would have expected the call to push_back() to be inlined away.

    May be the compiler needs to be strongly persuaded.

  6. Ah ! I missed the comment on the first line about g++ and opt flags, hence my question

    I hear icc is very good at inlining away such overhead. However I dont have acess to it.

  7. @srean Even without looking at the assembly I’m all but *certain* that the `push_back` call is inlined – this is a trivial optimisation for g++ and the standard library was *designed* with this optimisation in mind. The cost you see here is not the cost of the call, it’s the cost of the bounds check (`push_back` needs to check whether the internal buffer needs to be resized, even if we reserved memory beforehand). Otherwise `push_back` and the `[]` method would be on par – after all, the latter also implies a call. But it doesn’t have the overhead of the bounds check so we can infer that the time difference between these two methods is entirely explained by said bounds check.

    One word about the iterator method I posted above: I made an interesting mistake: the iterator should have been defined as a random access iterator. That way, the vector constructor of the libstdc++ implementation will pre-allocate an internal buffer of exactly the right size. My iterator implementation fails to provide this information. The mistake is interesting because g++ obviously pre-allocates the memory *anyway* – otherwise the runtime we observe would be impossible. This showcases how incredibly powerful the optimiser is – it infers the missing information about the required buffer size itself.

    In fact, correcting the mistake and implementing the iterator as a random access iterator doesn’t change the performance – conclusive proof that the compiler does the optimisation discussed above.

  8. @srean

    It is entirely possible that another compiler could optimize away the overhead of the push_back. Even so, if you really need speed, you can’t just hope for the best. I recommend you use push_backs sparingly in performance sensitive code.

    (I should stress that this is for vectors of integers. You might get vastly different results for vectors of objects.)

  9. Sure.

    In case I gave another impression, my point was never to argue against your recommendation, but to understand what exactly is the reason behind the slowdown and how can compilers mitigate it and if not why not. Konrad’s comments was very illuminating in that respect.

    I think even with vectors of objects, push_back will be slower than directly mutating the contents of a valid location in the vector.

  10. meh, tbh this kind of data is almost useless for me because I rarely spend my time filling vectors. I spend time reading them, so comparison of 2dvector and hackish 2d vector that is implemented as 1d vector would be more interesting.

    Also you should have mentioned std::move and how it makes STL faster in C++11.

  11. Nope, I was talking about:
    “Another problem with the vector template is that whenever you copy a vector, you copy all the data it contains. Again, to avoid this problem, you must use the swap method.”

  12. Perhaps a typo: the phrase “Suppose I want to create an array in C++ containing the numbers from 0 to N in increasing order” should perhaps have “N-1”.

  13. Consider using iterators to access the vector elements rather than using operator[], which needs to calculate the memory offset based on the position and the size of each element. It probably does it as such: return ptr[position], and the compiler then needs to calculate *((void*)ptr + sizeof(T) * position), which means a multiplication for every access. c++ Pointer arithmetic will do the sizeof multiplication implicitly.

    Consider this alternative:

    vector bigarray(N);
    auto it = bigarray.begin();
    auto end = bigarray.end();
    for(; it < end; ++it)
    *it = k;
    int sum = total(bigarray,N);
    return sum;

    This alternative should simply add sizeof(T) (in this case sizeof(int)) to the pointer. It means we only need one addition, instead of both an addition and a multiplication.

    This should close the gap with using c++ new.

  14. @Valrandir

    which means a multiplication for every access

    I agree that a compiler could implement this way, but modern compilers should know better.

    For one thing, a multiplication by sizeof(T) is a multiplication by a power of 2, which the compiler can implement as a mere shift.

    But even so, in this case the compiler almost certainly avoids even this overhead.

    Feel free however to run benchmarks to verify your conjecture (follow the link to get the code).

  15. I don’t think the person who answered that SO question knows what he’s talking about. The other SO article referenced there (http://stackoverflow.com/questions/4303513/push-back-vs-emplace-back) does support my claim. And from http://en.cppreference.com/w/cpp/container/vector/emplace_back: “Appends a new element to the end of the container. The element is constructed in-place, i.e. no copy or move operations are performed. The constructor of the element is called with exactly the same arguments that are supplied to the function.”

    If you’re constructing in-place and you avoid a temporary variable, that should be faster as you avoid both a second memory allocation and a copy.

  16. I’ve played with your benchmark code, and I can say: that’s a result of the benchmark you’re using.

    emplace_back is meant to avoid having to run a copy constructor on an object…but you’re using it in a benchmark that is adding a bunch of ints, which is a primitive, not an object. As a result, it’s entirely possible that you’re hitting code paths unnecessarily where emplace_back is requiring extra steps for a primitive as it normally requires a new object to be created to be created whereas the simple copy of the primitive is optimized away by the compiler.

    I can make some small changes to your benchmark and have emplace_back come out ahead — like using uint_fast64_t instead of int and removing the reservation (which is not always feasible, as I’m sure you’re aware). Yes, I’m aware that this is not the case you originally presented in your blog.

    Moreover, even with a very simple actual object I can make emplace_back become neck and neck with push_back and pull ahead if you remove reservation:

    class BenchmarkTester {
    public:
    BenchmarkTester()
    : m_myVec(100)
    , m_myBool(false)
    , m_myPair(24.5, std::string(“Happy days are here again”))
    {
    }

    private:
    std::vector m_myVec;
    bool m_myBool;
    std::pair m_myPair;
    };

    With complex objects, emplace_back can start becoming very time-saving. Like all things, it’s a matter of using it when appropriate.

  17. @Jeff

    Please see Why I like the new C++ and specifically the “move semantics” section where I show evidence that push_back is already much faster for large objects in C++11.

    I would really like to see a full benchmark that shows that emplace_back is even faster…

    (Look at my GitHub files, I did check it and I see no benefit to emplace_back… even for large objects.)

  18. I don’t know which GitHub files you’re indicating and don’t really have time to trawl through all of them finding other uses of emplace other than the one file you already pointed me to. But it wasn’t difficult in that one file to make changes (mainly using the simple object that I posted above) that caused emplace_back to be faster than push_back, and I expect that trend to be even more pronounced with increased object complexity and/or size.

    I didn’t actually even create a copy constructor in my class above, and still got slightly faster execution than push_back. If you have copy constructors that actually go through various members of a class and copy all of the data (perhaps running an equality check first to avoid copying if the objects are equal, which itself might take a lot of time), then I would expect emplace_back to not just equal push_back but surge way ahead.

    It’s just clearly not optimized for pushing primitives into a vector, as they aren’t proper objects to begin with.

  19. @Jeff

    Ok, I was able to get a 10% gain with emplace_back under GCC 4.7 using a variation on your class and a 20% with int on a fast destkop.

    Please see my new code there:

    https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2012/06/20/testvectorcpp11.cpp

    20% is not a lot, but it is on a primitive… however, only on my desktop, not on my laptop… However, if I switch to clang 3.2, the difference is even more significant.

    So my conclusion is that you are right that emplace_back is faster, though there might compiler/cpu issues to deal with.

  20. Cool 🙂 Sounds more like what should happen. After all, if emplace_back was slower in the general case, there would hardly have been much impetus to put it in 🙂

  21. Hello,

    I am facing a similar issue with C++ hash_map.
    I am reading one CSV input file and storing each column value (with column name) in a hash_map. These values are added dynamically and this hash_map is stored inside a “Record” object which is also allocated dynamically.
    So effectively, i have lot of memory being allocated dynamically and if the input file is huge in size, it ends up using a lot of memory.
    When the file is processed, i am calling “clear()” method on the hash_map and in addition deleting memory for these columns and the record object too. I have used appropriate destructors too.

    But even after calling clear() and delete(), the memory is not released to the O.S. It shows the same amount of memory being used by my process with the “top” command before calling clear() or delete().
    I tried using swap() trick to swap it with empty hash_map, but still it does not seem to release the memory.

    Am I missing something here ? Please give suggestions.

Leave a Reply

Your email address will not be published. Required fields are marked *