Avoid exception throwing in performance-sensitive code

There are various ways in software to handle error conditions. In C or Go, one returns error code. Other programming languages like C++ or Java prefer to throw exceptions. One benefit of using exceptions is that it keeps your code mostly clean since the error-handling code is often separate.

It is debatable whether handling exceptions is better than dealing with error codes. I will happily use one or the other.

What I will object to, however, is the use of exceptions for control flow. It is fine to throw an exception when a file cannot be opened, unexpectedly. But you should not use exceptions to branch on the type of a value.

Let me illustrate.

Suppose that my code expects integers to be always positive. I might then have a function that checks such a condition:

int get_positive_value(int x) {
    if(x < 0) { throw std::runtime_error("it is not positive!"); }
    return x;
}

So far, so good. I am assuming that the exception is normally never thrown. It gets thrown if I have some kind of error.

If I want to sum the absolute values of the integers contained in an array, the following branching code is fine:

    int sum = 0;
    for (int x : a) {
        if(x < 0) {
            sum += -x;
        } else {
            sum += x;
        }
    }

Unfortunately, I often see solutions abusing exceptions:

    int sum = 0;
    for (int x : a) {
        try {
            sum += get_positive_value(x);
        } catch (...) {
            sum += -x;
        }
    }

The latter is obviously ugly and hard-to-maintain code. But what is more, it can be highly inefficient. To illustrate, I wrote a small benchmark over random arrays containing a few thousand elements. I use the LLVM clang 12 compiler on a skylake processor. The normal code is 10000 times faster in my tests!

normal code 0.05 ns/value
exception 500 ns/value

Your results will differ but it is generally the case that using exceptions for control flow leads to suboptimal performance. And it is ugly too!

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

12 thoughts on “Avoid exception throwing in performance-sensitive code”

  1. To make the cost of exceptions more explicit, consider Java. When an exception is thrown, a full backtrace is stored. This involves quite some memory allocation.

    Now if you are in a language with elegant and efficient optionals such as Rust, you can get some very nice code to handle nonstandard cases.
    But things get more tricky when you need to, e.g., parse numbers from a stream. What if you expect a number, but it’s syntax is invalid? In Java, the method then no longer can return a native int, but you would need to trust the hotspot compilers escape analysis to eliminate all overhead from returning a double boxed Optional.
    Some tools (e.g., ELKI https://github.com/elki-project/elki/blob/146bcb9fbc428e9c4bccdfde3e6f17ae18a38ebd/elki-core-util/src/main/java/elki/utilities/io/ParseUtil.java#L162 ) use preallocated exceptions without a stack trace in performance critical paths, where exceptions are somewhat needed to not pay other overheads (boxing) and a lot of boilerplate code to unbox and handle the exceptions yourself.
    I don’t know if something similar would be possible in C?
    The Rust solution here seems very nice and elegant to me.

  2. I think that optimizing for the most common path and then adjusting for the extraordinary path is the best way to do things.

    Exceptions should be rare, building the context for an exception is done dynamically as far as I can remember. There is a fair amount of discussion about exactly this topic here on StackOverflow: https://stackoverflow.com/questions/13835817/are-exceptions-in-c-really-slow

    To build onto the comment of E.S., here is more historical context as to why C++ does not store the context except precisely where an exception occurs: https://stackoverflow.com/questions/3311641/c-exception-throw-catch-optimizations

    The current accepted answer as of May 16, 2022 links to the clang-dev mailing list describing why exceptions are done the way they are: https://lists.llvm.org/pipermail/cfe-dev/2015-March/042035.html

    Lemire does make an excellent point and I wholly agree.

    I have rather harsh opinion on error codes, mostly for the meat space aspect where context may not easily be understood or there is cascade of missing context(s) on the scheme of returning codes up the stack of a call that goes deep. But that is probably a discussion for a later time.

  3. The result of “0.05 ns/value” is highly dubious. If your test machine has 5GHz and you manage to execute 4 instructions per cycle, you still consume 0.05 ns/instruction. The code looks like it would take more than 1 instruction per value.

    You use the result, so the compiler probably doesn’t remove the entire loop. But the core looks simple enough for auto-vectorization. Depending on your view, that may make the comparison unfair.

    Then again, even translated to scalar code I would expect throughput around 1ns/value, still 500x faster than the exception variant.

    1. I am not sure why you are saying that the result is dubious. You would expect most compilers to autovectorize the exception-free routine. That’s the desired output.

      1. I understand his argument that you end up measuring two effects at the same time: the (in-)ability to auto-vectorize this simple code and the overhead of exception handling. And that it would be better to test an example where no auto-vectorization happens when you’re interested in the overhead of using exception traps itself.

          1. I am not an expect on microbenchmarking C.
            Maybe some compiler hint to prevent auto vectorization,
            __attribute__((optimize(“no-tree-vectorize”)))
            and a more expensive computation inside (say, sqrt or cos) helps. And a third version with no if statement, so we can measure the overhead of if/exception vs. branchless. Though I would expect branchless to not make much of a difference because of CPU pipelining. It if we get the 500 ns/value again, we know it’s reliable. Then try with positive/negative input data only to see if the cost is asymmetric. Because if it’s highly asymmetric, it may still be fine to use for low frequency case handling.

      2. Maybe “surprising” would have been a better term. Typically when I see performance numbers far outside my expectation, there is a bug somewhere. In this case the answer is auto-vectorization, so not a bug.

        Whether it is a fair comparison is a matter of debate. Both answers seem defensible. Anyway, thank you for the interesting benchmark.

  4. I randomly stumbled across this article and decided to measure it using Google’s benchmark library. I did it locally and on Quickbench and did not observe the same results as you. You can take a look here:

    https://godbolt.org/z/EohfcxMx6

    Click on the “Quick-bench” link above the source code, and then “Run benchmark” on quick bench website. Alternatively you could build this locally and test, though you do need to ensure you turn of cpu throttling to get accurate numbers.

    If you haven’t already, Niall Douglas’ outcome and status-code are worth checking out (as well as std::expected), Niall has done some interesting benchmarks:

    https://www.boost.org/doc/libs/master/libs/outcome/doc/html/faq.html#is-outcome-suitable-for-fixed-latency-predictable-execution-coding-such-as-for-high-frequency-trading-or-audio

    https://github.com/ned14/outcome
    https://github.com/ned14/status-code

    1. Your benchmark measures the time needed to sum empty arrays. Unsurprisingly, it is a flat tiny cost.

      Try with the following routine.

      for (size_t i = 0; i != 10000; i++) {
      data.push_back((rand() % 1000) - 500);
      }

      Google Benchmark is convenient, but it does not do anything magical. This being said, tools like Google Benchmark do report the effective CPU frequency when they can so any throttling would be detected. In this instance, it is highly unlikely you will experience throttling on a reasonable platform. In any case, the result is so plain that no sophistication is needed to measure it.

      Thanks for the links.

      1. Oops! You are correct, I’ve updated the example here:

        https://godbolt.org/z/Yq3oh9oo7

        On this updated benchmark on Quickbench, the with exception version is 450 times slower than the without exception version.

        I agree that Google benchmark doesn’t do anything magical, but it does make sharing benchmarks much more concise and does offer quite a few nice utility functions/macros that are annoying to re-implement.

Leave a Reply to E.S. Cancel reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

You may subscribe to this blog by email.