Reusing a thread in C++ for better performance

In a previous post, I measured the time necessary to start a thread, execute a small job and return.

auto mythread = std::thread([] { counter++; });
mythread.join();

The answer is thousands of nanoseconds. Importantly, that is the time as measured by the main thread. That is, sending the query, and getting back the result, takes thousands of nanoseconds and thousands of cycles. The work in my case is just incrementing a counter: any task more involved will increase the overall cost. The C++ standard API also provides an async function to call one function and return: it is practically equivalent to starting a new thread and joining it, as I just did.

Creating a new thread each time is fine if you have a large task that needs to run for milliseconds. However, if you have tiny tasks, it won’t do.

What else could you do? Instead of creating a thread each time, you could create a single thread. This thread loops and periodically sleep, waiting to be notified that there is work to be done. I am using the C++11 standard approach.

  std::thread thread = std::thread([this] {
    while (!exiting) {
      std::unique_lock<std::mutex> lock(locking_mutex);
      cond_var.wait(lock, [this]{return has_work||exiting;});
      if (exiting) {
        break;
      }
      counter++;
      has_work = false;
      lock.unlock();
      cond_var.notify_all();
    }
  });

It should be faster and overall more efficient. You should expect gains ranging from 2x to 5x. If you use a C++ library with thread pools and/or workers, it is likely to adopt such an approach, albeit with more functionality and generality. However, the operating system is in charge of waking up the thread and may not do so immediately so it is not likely to be the fastest approach.

What else could you do? You could simply avoid as much as possible system dependencies and just loop on an atomic variable. The downside of the tight loop (lockspin) approach is that your thread might fully use the processor while it waits. However, you should expect it to get to work much quicker.

  std::thread thread = std::thread([this] {
    thread_started.store(true);
    while (true) {
      while (!has_work.load()) {
        if (exiting.load()) {
          return;
        }
      }
      counter++;
      has_work.store(false);
    }
  });

The results will depend crucially on your processor and on your operation system. Let me report the rough numbers I get with an Intel-based linux box and GNU GCC 8.

new thread each time 9,000 ns
async call 9,000 ns
worker with mutexes 5,000 ns
worker with spinlock 100 ns

My source code is available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

15 thoughts on “Reusing a thread in C++ for better performance”

  1. Hi Daniel,
    (Been following this blog for years, yet this is my first comment.)

    Every now and then and for very long time, this subject was intriguing me, i have similar result like yours, and that is not the question, the question is why CPU technology/industry is and was mostly driven by the need for more speed and more cores, yet it doesn’t focus or even ignore this exact point, switching between thread, i mean GPU’s have hundreds cores, CPU’s already have tens, yet having specific instruction like similar to HLT (halt) to be waked up by another instruction or dedicated instructions set to time very short sleep to save power, this might be very useful and will boost the speed in cases or save power in different cases,
    Why the need of switching between threads in an efficient way seems to be like not important or not a priority?

    For me it looks like been decided, this one is a software issue to resolute or to live with, yet those CPU technologies do evolve to hasten specific software problems, may be it is hard or wrong to do in hardware, may be, on other hand seeing what was considered to be hard or impossible 15 or 20 years ago (or even more) , in a device you can hold in one hand does means one thing, that hard and impossible are relative matter and not absolute.
    Is it wrong to begin with ? or just wrong now in relative to our time, and this can be seen differently in few years.

    Daniel, i would love to read your opinion and thoughts about that, may be blog post.

    1. switching between threads in an efficient way seems to be like not important or not a priority

      It is very application dependent. In HPC (scientific computing) programs typically pin one thread to each core so they don’t disturb each other, meanwhile operative systems are optimized to minimize the noise introduced by other applications taking CPU time.

      yet having specific instruction like similar to HLT (halt) to be waked up by another instruction or dedicated instructions set to time very short sleep to save power

      In Intel processors you already got something like that. The instructions monitor and mwait track a memory location and put the core in a low power state. The problem is that it is processor specific and not portable to other platforms.

    2. Hi KasOb,

      There’s definitely a lot going on in CPU technology to reduce the cost of concurrency and context switching:

      Hyperthreads are definitely the most well-known: the CPU exposes a single core (with a single set of execution ports) as a pair of “logical” cores to the OS, which can schedule 2 different tasks on it; the CPU executes both tasks interleaved, and whenever one task blocks (for instance, due to a cache miss or atomic memory operation, or if it’s spinning over a lock and signals it with mm_pause) the other task can run.
      In a more-traditional system (no HT, software scheduler) the cycles that the task spent blocking would simply be “lost” (no useful work happening).
      New concurrency-related hardware features (lock elision, hardware transactional memory, …) enable faster implementations of locks/semaphores, work queues, etc…
      Those hardware features are not really consumed directly by most software engineers, as they require very specialised knowledge to use effectively, but libraries of high-performance concurrency primitives tend to leverage them.
      On ARMv8 CPUs, the NVIC (Nested Vectored Interrupt Controller) supports fairly complex/flexible task configurations.
      For instance, the RTIC (Real-Time, Interrupt-driven Concurrency) framework reduces a program’s scheduling policy (i.e. the relative priorities of various tasks) to an NVIC configuration at compile time, meaning that all context switching and task management is managed by the hardware, rather than having a software scheduler. Cherry on top, RTIC extracts information about which resources are used by each task, to both avoid unnecessary locks (if a task uses a given shared resource, but no higher-priority task does, it can safely avoid taking-and-releasing the lock) and avoid unnecessarily blocking (when a task A is in a critical section, only tasks which use some of the same resources are blocked; higher-priority tasks that do not interact with A can still preempt it as needed).
      I’m not aware of any general-purposed OS doing this, though. 🙁

      1. Thank you Nicolas,

        What did you described about ARMv8 is in fact very interesting (i didn’t know that) , also reading that Apple will release its Mac with ARM processors in 2021, indicate the processing technology race is not slowing down on contrary it is picking up pace.

        The Cherry you mentioned, IMHO it makes sense to be used to simplify the multi-reader single-writer implementation (may be for multi-writer in atomic behaviour !), to provide higher level of efficiency with lower power consumption.

        Thank you again for replying with these information.

  2. In our case, since the operating system closes a thread down in its own time, we quickly ran out of threads using the first approach. Re-using the thread was the only workable solution.

    1. It is intriguing. Did you join your threads and still get the problem? I am hoping that once the call to join succeeds, the thread is gone. Callign detach would be something else… but I hope that “join” actually cleans the thread up…

  3. The spinlock approach is something that should be avoided by any means. Especially on single core machines this will effectvely kill the performance of the whole system. I would never ever do that!

  4. Some quick observations you might not be aware of:

    when spinning on a lock, it’s usually a good idea to emit an instruction signalling that to the CPU (mm_pause on x86/amd64, yield on Arm) : it enables optimisations such as switching to another hyperthread on the same core when waiting for the lock, or going low-power (modern CPUs are often bottlenecked by heat management, so going low-power can let other, useful work happen at a higher clock frequency)
    good mutex and work queues implementations already spin for a short while (to optimise away the context switch when duty cycle is high) before parking the thread (typically using a futex, so the OS scheduler knows exactly when to wake up a thread as work becomes available) ; I wasn’t quite capable of figuring out what the GNU libstdc++ does, from reading the relevant code, but it seems not to do spin-then-futex for some reason.
    in more general work-queue usecases, using a spinlock alone is susceptible to priority inversion: if some thread gets interrupted in the critical section, the OS might schedule the other threads (that are spinning uselessly) instead of the one holding the lock.

Leave a Reply to Rudi Cancel reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax