Cost of a thread in C++ under Linux

Almost all our computers are made of several processing cores. Thus it can be efficient to “parallelize” expensive processing in a multicore manner. That is, instead of using a single core to do all of the work, you divide the work among multiple cores. A standard way to approach this problem is to create threads.

A C++ thread object executes some functions, possibly on a thread created by the operating system, and goes away. If you wanted to increment a counter using a C++ thread, you could do it in this manner:

auto mythread = std::thread([] { counter++; });
mythread.join();

It is obviously silly code that you should never use as is, it is a mere illustration. Creating a new thread is not free. Exactly how expensive it might be depends on several parameters. But can we get some rough idea?

For this purpose, I wrote a small benchmark where I just create a thread, increment a counter and let the thread die. It is the time elapsed while waiting for the thread to run its course. My program computes the mean and standard error of the time, as well as the minimum and maximum duration of the test. For simplicity, I am just going to report the means:

system time per thread
Ampere server (Linux, ARM) 200,000 ns
Skylake server (Linux, Intel) 9,000 ns
Rome server (Linux, AMD) 20,000 ns

I am deliberately not going into the details of the compiler, system library, operating system, RAM and all that fun stuff. You should not look at my table and make far reaching conclusions.

What is clear, however, is that creating a thread may cost thousands of CPU cycles. If you have a cheap function that requires only hundreds of cycles, it is almost surely wasteful to create a thread to execute it. The overhead alone is going to set you back.

There are clever ways to amortize the cost of a thread. For example, you may avoid the constant creation of new threads as in my benchmark. Yet to amortize, you still need to have enough work to make it worthwhile.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

24 thoughts on “Cost of a thread in C++ under Linux”

        1. You do not need to use the scientific notation to express your results using a given number of significant digits. Of course, the scientific notation is more precise in this respect, but it is also less readable.

  1. It seems fine.

    The reported times are a blink of an eye: a less than 1% of one millisecond in the case of the Intel server. That’s a small fraction of almost any total process runtime (i.e., much faster than a “hello world” process), so when it comes to amortization, I think you’ll either have enough work, or it didn’t matter anyway.

    What would be interesting would be to an evaluation of the fastest amortized methods of using threads. I still don’t think functions of ~100 cycles are going to see a benefit, just due to inter-core communication costs, but you can probably squeeze out a benefit at 1,000 cycles.

    All this definitely means that extracting performance from threads isn’t easy – it’s a hard engineering problem.

    1. We use thread pools at work quite a lot, and it is really difficult to squeeze out full performance via them.

      According to Martin Thompson (https://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html), his inter core latency is 45ns, however for real physical cores, I have never been able to get below ~75ns on skylake (same numa) which implies a roundtrip time of ~150 (~500-600 cycles). All these are sort of theoretical numbers, so I agree with your assertion that if the task is taking less than < 1000 cycles, you can’t squeeze out performance via threading.

      1. Yeah I think 1,000 is where it starts to become interesting, since now your inter-core RT times are say single-digit % of the total time, so it starts to seem plausible, but it is a hard problem to exploit it: you have virtually no slack.

        Forget even making one kernel call per thread, it has to be zero on average.

        1. Why do kernel calls diminish thread benefits? If other threads can can continue to do useful work, what burden does the kernel call in one thread impose on the others?

    2. I like your scale here. It puts it into terms that engineers can start to reason about as they design their applications and think about introducing concurrency primitives.

    1. I think that’s different. Here I am measuring overhead…

      If you have a small, embarrassingly parallel problem, Amdahl’s law will allow for good results… but if the problem is too small, you won’t get good results.

      Suppose you have four independent jobs to do, but it takes you one hour on the phone to hire an intern to do them. You might be better off doing the four tasks yourself.

      1. I think you can use a variant of Amdahl’s law and this is kind of how I think of it.

        Essentially, the overhead is like the serial portion in Amdahl’s law. So if the overhead is 1000 ns and the work, single threaded is 3000 ns, you use a serial fraction of 0.25. Really, the overhead is not serial, but rather parallel but proportional to the number of threads but that works out to almost the same thing.

        One way of thinking about it is that the serial part is the part where the original thread is doing work, and the other threads are suffering their overhead period, after which there is a parallel phase.

        This derivation is inexact in the sense that some of the overhead also occurs on the original thread, so the overhead is not really all parallel. I fact, I think this makes it kind of very not-Amdahl if most of the overhead ends up on the original thread, but I’m going to submit this comment anyways.

        1. I like the way you put it.
          To me, the “p” parameter in Amdahl’s law is the “total work”, and that work is composed of 1/ overhead and 2/ useful work that can be carried out in parallel.
          Furthermore, the overhead has 2 components: one that is spent on the main thread while spawning another thread (or communicating with other threads) and another that is spent on the worker thread before doing useful work and that is just the cost of acquiring the task (for instance it’s the cost of dequeueing the task).

          I agree with Travis that the cost of spawning threads or communicating with a pool of threads is proportional to the number of workers (i.e. it gets worse when there are more workers).

          Another note, just doing “counter++” causes inter-core traffic and false sharing, so this example might even be susceptible to overhead of the CPU architecture (not even talking about OS or C runtime overhead).

          1. Another note, just doing “counter++” causes inter-core traffic and false sharing

            I am not sure that there is false sharing. My own definition is of false sharing is when two active threads modify the same cache line, but at different locations in the cache line.

            1. That may be the strict definition of false sharing, but the effects are the same: the cache lines must be synchronized between cores and that makes other cores stall when the variable is modified on one core (marked “E” in the MESI protocol). This forces local caches to wait and refresh their copy, only to attempt making the line exclusive locally.

              1. I understand and agree with what you write, but coming back to the terminology, why would it be “false” sharing? My understanding of “false sharing” is that you may have two threads that appear to work on different data… and they are… except that if the cache lines are the same, then it is as if they shared the same data. Hence the qualifier “false” (they are not really sharing).

                BTW I have updated my code with an empty thread benchmark. The empty thread is slightly faster as you would expect.

                1. You are right, I was abusing the meaning of “false sharing”. Indeed in the case of this code, the sharing is very much true 🙂
                  My sentence was ‘just doing “counter++” causes inter-core traffic and false sharing’, so ‘inter-core traffic’ (a.k.a true sharing) is the concern here, although there could also be false sharing if so other variable colocated with “counter” happens to be read/written from other threads (not the case in your code).

                  1. I agree, but I guess the effect is probably very small here because Daniel is only creating one thread at a time and this type of inter-core communication is probably on the order of 100 ns, but the tests results are roughly 10,000 ns per thread in the best case.

                    This kind of stuff becomes much more important once you’ve eliminated the overhead of creating and tearing down a thread for each unit of work, and especially once you’ve eliminated any kernel calls (at which point the 100 ns matters and may be the dominant factor limiting how much you can do in parallel).

  2. I am surprised about such great variability among CPUs with the same kernel, thanks for that. Other OSes would probably be a terrible publicity stunt..

    Could you try the following cheat?

    auto f = std::async(std::launch::deferred, []{counter++;}) ;
    f.get();

    Also, C++ does not have standard thread pools, but the C++17 parallel algorithms have to manage threads automagically somehow. It would be interesting to inspect their overhead. There are several competing implementations though.

  3. I have a recollection from my undergrad years, when I was reading some of the Solaris man pages, that they advised that you shouldn’t create a new thread for a task unless that task had a minimum of 1,000 instructions required, otherwise the overhead wasn’t worth it.

    It’s a bit curious that 20+ years later, despite advances in hardware, your own analysis comes to about the same conclusion. Maybe I shouldn’t be surprised… even if the instructions are 10x faster now (say), I suppose the ratios would all stay the same.

  4. It woud be interesting to compare the energy costs of the three CPUs. ARM is said to be energy efficient, but it took more time for the task.

    So, how much energy per task does the CPUs consume?

  5. Structuring your code to use thread pools could be helpful in some situations. Same purpose as database connections pool. We know that establish a connection is slow, you need to re-use them instead.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax