# Do not waste time with STL vectors

I spend a lot of time with the C++ Standard Template Library. It is available on diverse platforms, it is fast and it is (relatively) easy to learn. It has been perhaps to conservative at times: we only recently got a standard hash table data structure (with C++11). However, using STL is orders of magnitude safer than dealing with pointers.

I use the vector template in almost all my C++ software. Essentially, you can use it whenever you need a dynamic array. The title of my blog post is word play: please use vectors, but use them properly if you need speed!

When using the vector template in high performance code, we must be aware of several important facts. For example, the capacity (memory usage) of a vector will never decrease, even if you call the clear method, unless you use the swap method. In my opinion, making it difficult to release memory was a poor design decision. Another problem with the vector template is that whenever you copy a vector, you copy all the data it contains. Again, to avoid this problem, you must use the swap method.

But a trickier problem with STL vectors is that there are many ways to build them. Suppose I want to create an array in C++ containing the numbers from 0 to N in increasing order. Then I want to compute the sum. I hear that mathematicians have come up with a formula for this problem, but your Intel microprocessor does not know this.

• The conventional approach is to allocate the memory using C++’s new:
```int * bigarray = new int[N];
for(unsigned int k = 0; k<N; ++k)
bigarray[k] = k;
int sum = total(bigarray,N);
delete [] bigarray;
return sum;
```
• Or, you can do the exact equivalent in STL:
```vector<int> bigarray(N);
for(unsigned int k = 0; k<N; ++k)
bigarray[k] = k;
int sum = total(bigarray,N);
return sum;
```

It is nicer because you don’t need to remember to recover the memory allocated. There is a price to pay however because when you first construct the vector, the memory is initialized to zero.

• You can do it in pure STL fashion, with the push_back method:
```vector<int> bigarray;
for(unsigned int k = 0; k<N; ++k)
bigarray.push_back(k);
int sum = total(bigarray,N);
return sum;
```
• We can improve the push_back method by using the fact that we know ahead of time how big the array will be. Thus, we can reserve the memory before the push_backs.
```vector<int> bigarray;
bigarray.reserve(N);
for(unsigned int k = 0; k<N; ++k)
bigarray.push_back(k);
int sum = total(bigarray,N);
return sum;
```

It is likely to be slightly faster.

So what is the result? On my Intel Core i7 desktop, I get the following numbers of CPU cycles per integer processed:

method  cycles per integer
C++ new 8.5
STL vector 8.9
push_back 20.6
reserve + push_back 12.6

As usual, my code is freely available.

Note that you can cheat and get the same speed as a C++ new by constructing the vector empty, then reserving the memory (reserve) and never actually resizing the array. It is outside the specifications and could potentially be unsafe, but it saves you up to half a CPU cycle per integer.

Conclusion STL vectors can be quite fast but if you use them to store integers, the push_back method can be relatively expensive.

1. Hi Daniel,
Thanks for this post, this is good timing for a project I’m working on. If you intend to investigate this kind of topic in more depth, I’d love to see some analysis of using vector<vector> for 2-d arrays of raw data. In particular I’m trying to understand how this representation affects memory locality, and how I can optimally allocate and traverse column-major blocks of data retrieved from a 3rd-party. The goal is to produce typed data values from the raw bytes of each column. Since each column may have a very different data type, the memory access stride will change when finishing one column and moving to the next, so this already impacts the hardware prefetcher. Are things made worse by the fact that each column may be allocated in very different locations of memory? Last, I have a separate 2-d array of per-cell status indicators, so this compounds the locality challenge.

Thanks for maintaining this blog, I’ve found it very useful.
-Robert

Comment by Robert Morton — 20/6/2012 @ 14:46

2. In your code, yhy is the following not safe?

int runtestnoalloc(size_t N, int * bigarray) {
for(unsigned int k = 0; k<N; ++k)
bigarray[k] = k;// unsafe

int sum = 0;
for(unsigned int k = 0; k<N; ++k)
sum += bigarray[k];

return sum;
}

The bigarray is never out of scope so shouldn't it be fine? Is it because bigarray[k] can overwrite something in the memory?

Comment by mars — 20/6/2012 @ 14:49

3. @mars

It is perfectly safe in this instance, but playing with pointers can lead to segmentation faults. How do you know that whoever called this function has allocated enough memory?

Using STL containers is safer because you can check.

Comment by Daniel Lemire — 20/6/2012 @ 14:57

4. If you really need the raw speed during construction, there’s another variant, which I’ve just sent you via a pull request on GitHub.

The salient part is construction of the vector via a pair of iterators. But we don’t actually construct the range of values in memory (that would just shift the initialisation problem elsewhere), we use a special iterator type “iota” which generates the values on the fly. Creating this iterator incurs a bit of boilerplate code but it’s a general-purpose class that can easily be recycled.

This is “good C++” because it uses high-level algorithmic building blocks rather than manual memory management or UB hacks, and furthermore increases the abstraction of the initialisation code (making the loop obsolete).

Interestingly, this variant is *even faster* than the fastest other code (which is illegal C++ anyway and can’t really be counted) when compiled with optimisation level 3, and it’s on par with the other methods when compiled with O2.

Comment by Konrad — 20/6/2012 @ 15:46

5. Was this for g++ ? I would have expected the call to push_back() to be inlined away.

May be the compiler needs to be strongly persuaded.

Comment by srean — 20/6/2012 @ 16:53

6. @srean

I post the code so that you can run your own checks:

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2012/06/20/testvector.cpp

There is a significant cost to using push_back.

Comment by Daniel Lemire — 20/6/2012 @ 17:45

7. Ah ! I missed the comment on the first line about g++ and opt flags, hence my question

I hear icc is very good at inlining away such overhead. However I dont have acess to it.

Comment by srean — 21/6/2012 @ 0:31

8. @srean Even without looking at the assembly I’m all but *certain* that the `push_back` call is inlined – this is a trivial optimisation for g++ and the standard library was *designed* with this optimisation in mind. The cost you see here is not the cost of the call, it’s the cost of the bounds check (`push_back` needs to check whether the internal buffer needs to be resized, even if we reserved memory beforehand). Otherwise `push_back` and the `[]` method would be on par – after all, the latter also implies a call. But it doesn’t have the overhead of the bounds check so we can infer that the time difference between these two methods is entirely explained by said bounds check.

One word about the iterator method I posted above: I made an interesting mistake: the iterator should have been defined as a random access iterator. That way, the vector constructor of the libstdc++ implementation will pre-allocate an internal buffer of exactly the right size. My iterator implementation fails to provide this information. The mistake is interesting because g++ obviously pre-allocates the memory *anyway* – otherwise the runtime we observe would be impossible. This showcases how incredibly powerful the optimiser is – it infers the missing information about the required buffer size itself.

In fact, correcting the mistake and implementing the iterator as a random access iterator doesn’t change the performance – conclusive proof that the compiler does the optimisation discussed above.

Comment by Konrad — 21/6/2012 @ 4:03

Comment by srean — 21/6/2012 @ 5:27

10. @srean

It is entirely possible that another compiler could optimize away the overhead of the push_back. Even so, if you really need speed, you can’t just hope for the best. I recommend you use push_backs sparingly in performance sensitive code.

(I should stress that this is for vectors of integers. You might get vastly different results for vectors of objects.)

Comment by Daniel Lemire — 21/6/2012 @ 10:51

11. Sure.

In case I gave another impression, my point was never to argue against your recommendation, but to understand what exactly is the reason behind the slowdown and how can compilers mitigate it and if not why not. Konrad’s comments was very illuminating in that respect.

I think even with vectors of objects, push_back will be slower than directly mutating the contents of a valid location in the vector.

Comment by srean — 22/6/2012 @ 21:38

12. meh, tbh this kind of data is almost useless for me because I rarely spend my time filling vectors. I spend time reading them, so comparison of 2dvector and hackish 2d vector that is implemented as 1d vector would be more interesting.

Also you should have mentioned std::move and how it makes STL faster in C++11.

Comment by Ivan — 25/6/2012 @ 10:42

13. @Ivan

You think that std::move would make things faster in this particular case?

Comment by Daniel Lemire — 25/6/2012 @ 10:53

14. Nope, I was talking about:
“Another problem with the vector template is that whenever you copy a vector, you copy all the data it contains. Again, to avoid this problem, you must use the swap method.”

Comment by Ivan — 25/6/2012 @ 10:59

15. ofc, it only work for temporaries if recall correctly. RVR are mystic to me tbh .

Comment by Ivan — 25/6/2012 @ 11:04