Developers often believe that software performance follows a Pareto distribution: 80% of the running time is spent in 20% of the code. Using this model, you can write most of your code without any care for performance and focus on the narrow pieces of code that are performance sensitive. Engineers like Casey Muratori have rightly criticized this model. You can read Muratori excellent piece on his blog.
It is definitively true that not all of your code requires attention. For example, it is possible that 99% of the time, your code processes correct data. The code that handles errors could be quite slow and it would not impact most of your users.
But the hotspot predicts something more precise: you should be able to just keep all of your code, identify the specific bottlenecks, optimize these bottlenecks, and then get great performance all around. Muratori relies on empirical evidence to falsify the model: many companies embarked in large rewrites of their codebase to optimize it. Why bother with such expenses if they could simply identify the bottlenecks and fix those?
We have tools today called profilers that can tell you roughly where your software spends its time. And if you apply such a tool to your software, you may indeed find massive bottlenecks. It may sometimes work wonders. For example, there was a video game (GTA Online) that was loading a JSON file. Simply optimizing this one bottleneck solved a massive performance issue. It did not make the game more performant, but it made it start much faster. So bottlenecks do exist. We should hunt them down and optimize them. But that’s not what the hotspot model predicts: it predicts that it is all you need to do to get performance. Hit a few bottlenecks and you are done.
Sadly, it does not work.
Let us run down through a few reasons:
- Overall architecture trumps everything. If you first build a bus, you cannot then turn it into a sports car with a few changes. A few years ago, a company came to me and offered me a ton of money if I could make their database engine faster. They had money, but their software was much too slow. At first I was excited by the project, but I started reading their code and doing benchmarks, and then I realized that the entire architecture was wrong. They insisted that they knew where the hotspots were and that they just needed the expertise to optimize these few components. They told me that their code was spending 80% of its time in maybe 100 lines. And that is what profilers said. It is true, formally speaking, that if you could have made these 100 lines of code twice as fast, the code would have run twice as fast… but these lines of code were pulling data from memory and software cannot beat Physics. There are elementary operations that are not time compressible: you cannot move data faster than what is allowed by the hardware. The key point is that if your software does not have a good overall architecture, if it is not well organized for performance, you may have no choice but to rewrite it from the ground up, to re-architecture it.
- As you optimize, the hotspots multiply. Going back to the example of GTA Online, it is easy to find that the program spends 10 seconds loading a 10 MB JSON file. However, the next steps are going to be more difficult. You will find that that finding the bottlenecks become difficult: we are subject to a Heisenberg principle: measuring big effects is easy, measuring small ones becomes impossible because the action of measuring interacts with the software execution. But even if you can find the bottlenecks, they become more numerous with each iteration. Eventually, much of your code needs to be considered.
The effort needed to optimize code grows exponentially. In other words, to multiply the performance by N, you need to 2N optimizations. The more you optimize, the more code you must consider. It is relatively easy to double the performance of an unoptimized piece of code, but much harder to multiply it by 10. You quickly hit walls that can be unsurmountable: the effort needed to double the performance again would just be too much. In effect, we have a long tail effect: though there are clearly easy wins, much of the work lies in the ‘tail’.
And that explain why companies do full rewrites of their code for performance: the effort needed to squeeze more performance from the existing code becomes too much and a complete rewrite is cheaper.
It also means that you should be acutely concerned about performance when you design your software if you want to avoid a rewrite. Knuth told us that premature optimization is the root of all evil, but he meant that before you worry about replacing your switch/case routine with gotos, you should work on the algorithmic design and code architecture. In effect, Knuth was preemptively rejecting hotspot performance engineering, and you should too.
Further content: Federico Lois — Patterns for high-performance C#.
16 thoughts on “Hotspot performance engineering fails”
This also ignores nonlocal effects. It is entirely possible for all of the following to be true simultaneously:
80% of the time is spent in 20% of the code.
There is not much more optimization to be squeezed out of said 20% of the code.
The code can be optimized significantly.
If I write a piece of code that, for instance, temporarily grabs a huge amount of memory and causes the OS to drop most of its page cache as a result, said piece of code can be “fast” and thus not really show up on profiles… but its overall effect on the program can be huge.
This sort of aggressor/victim issue shows up all over the place: cpu cache, branch prediction, TLBs, OS page cache (and paging in general), texture cache, disk accesses, atomic contention, hyperthreading, JITter deoptimizations, etc, etc.
I’ve seen too much code that is far slower than it could be, because the authors have effectively given up on optimizing it further, in turn because they’ve heavily optimized the “hot” section and ignored the “cold” sections, missing that the reason the “hot” section is slow is that it is a victim of the “cold” sections.
I once made a custom hash table implementation as a python function. I used python’s c-api to request a function pointer to pythpn’s builtin hash function (SipHash). This performed well enough, but I found that using XXHash instead made the function even faster.
Then I started to benchmark the entire program rather than just the function, and the XXHash implementation made it slower. I think this is because python uses its own hash quite a lot (lots of things are python dictionaries under the hood) which kept the has in the L1 instruction cache. By the time my custom hash table function came around, fetching the instruction sequence for XXHash was slower than using python’s builtin hash.
So this is exactly as nonnull describes.
Another example is where I benchmarked a python script that did not take into account cache locality by operating one large lists. The slowness did not show up in the profiler. The script took 26 seconds but the cumulative time of all the functions profiled was less than 8. So I rewrote it to perform all the transformations on one item at the time using iterators. That made it 10 times faster.
I like to think of hotspot optimization as a way of getting closer to the local optimum; however not the global optimum.
One important thing about hotspot optimization is that the initial 2x wins are often trivial changes — and the sad part is that people don’t do those initial optimizations. The “trivial thing” is usually something like, “Oh, you forgot to configure your database connection pooling to allow multiple connections”. It’s also often not because they don’t know how to do the trivial optimization, but because they haven’t realized that there is a problem in a first place.
But as you’ve said, there’s only so far you can go with them. If I don’t see an obvious place in the profile to make the small fix, then I’ll stop.
“Premature optimization is the root of all” evil is sometimes attributed to Hoare, but it was Knuth.
A fellow academic and a long time reader/fan of your blog here.
This is a common belief/recommendation that “optimizing the error path is unnecessary because it rarely happens”. Our research shows that slow error paths can lead to what we call “metastable failures”. Please see this paper for a high-level description of these failures, specifically section 2.3, and this paper for in-the-wild examples. Here’s a public example (also cited in the 2nd paper) of a case where slow error path caused a metastable failure: https://engineering.atspotify.com/2013/06/incident-management-at-spotify/
Yes, slow paths can be a vulnerability.
A good idea is decoupling the 80/20 rule from the conclusion that change must be applied to that hot 20% of code.
Another is to broaden the 20/80 rule to data, rather than just code. Profilers can help by highlighting hot data.
Have been confronted with the “premature optimization…” cite more than once as a sloppy excuse for weak code, worst case was (me) “you have there a quadratic runtime, which can be done in linear” – “yeah, already figured, but it was more straight forward to code it this way, and btw premature optimization…” – Haha, well the problem was sleeping for 2 months before it shot back, just when the first real data got processed…
I always understood Knuth’s cite more as a warning not to go down the rabbit hole chasing the last <5% of optimization, which gets harder and harder if the lowing hanging fruits are already fixed. But using it as an excuse not to think about a better approach in the first place is imho just wrong.
Sometimes I wonder if CS courses should emphasize bare metal mechs (computer architecture) & complexity more again, as the knowledge, what a machine really has to do about the data, seems to get thinner. Ofc this is only anecdotical evidence from my side, but it got more of an issue in our local university, when they moved from C to Java as their introduction programming course. “The GC will fix that for me” is already pretty close to “dont make me think”. Oh wait there is ChatGPT for the rescue – isn’t it? 😉
The quote will more context
Is Knuth from, “Structured Programming with go to Statements” on page 8 of 41. In the next paragraph the important part that everyone skips.
The Knuth performance quote gets abused almost as often as programmers optimize the wrong things. It doesn’t look like Knuth is arguing against architectural efficiency or thinking about performance. The context is optimizing existing code needlessly w/o measuring. And that measuring should be a critical part of developing a system, so much so that it should be always-on.
The fatal flaw in this approach is that you can have regions of inexpensive code that nevertheless significantly slow down the program overall. (For instance, forcing the OS to drop its page cache.) I talk about this more in my prior comment.)
If you require prior measurement of “need” to optimize a piece of code, you’ll miss things like this and end up in a penny-wise-pound-foolish local optimum, where everything that takes a lot of time has been overoptimized, and yet you are still slow.
Would extend this issue to the whole (operating) system – in the end everything is fighting over register usage, cache lines, mem pages and so on, and the kernel tries to balance things out from its schedulers. But with wicked code you can raise the kernels runtime significantly (e.g. by sheer syscall pressure or lousy memory handling) – and the whole system may suffer.
The interesting bit here is the fact, that nowadays performance concerns are much more IO-related, while in the past the processing speed of the CPUs was a big limiting factor, too. We kinda put CPUs on steroids, but somehow the peripherals did not keep up. Which puts a major burden on clever resource management on kernel side.
My impression is that CPU performance has long stagnated due to the failure of Dennard scaling, and the repeated failures of the leader (Intel). However, we kept getting more bandwidth.
In the last decade, our disk IO was multiplied by at least 10. My PlayStation 5 has a 5 GB/s disk. The network speeds also improved by a similar factor.
Our single-processor performance surely did not improve in a similar manner after controlling for prices. That is, you can get a 32-core Intel PC today that will be worth about 10 CPUs from ten years ago, but it is going to be expensive. Of course, GPUs came to the rescue, but GPUs are specialized devices.
Yes true, just looking at the last 10ys we had bigger shifts on IO side. IO lost most of its ground in the 90s and early 2000s and has still way to go.
To me it seems, that most of the M1 success story is IO-related, which may give a glimpse on how bad things are on x86, where we instead get more cache to play with (and the 1001th SIMD instruction). Ofc the M1 has very different production needs and costs with its big-die-approach, no clue if that would even scale for broader industry adoption. However, it seems the M1 was needed to wake up x86 vendors from their slumber.
On a sidenote: I am really curious how the risc-v vector extension with its variable registers will turn out, it reads a bit like the re-invention of old Cray machines.
To me it seems, that most of the M1 success story is IO-related,
The M1 can retire 8 instructions per cycle, something that no x64 can do.
The ARM-based Graviton processors on AWS can beat the best of the Intel server processors on a computational task (parsing URLs): https://lemire.me/blog/2023/03/01/arm-vs-intel-on-amazons-cloud/
I am still very excited about Intel processors, especial AVX-512…
While the fact that M1 can retire 8 instructions per cycle and x86 do not, you surely knows this is a very misleading representation.
The historical take of x86 on this subject has always been “well, our instructions do more per sintruction”, which has always been only half-true at best. OTOH, the modern x86 processors convert the x86 instructions to an internal, hidden, instruction architecture, which is what really gets executed. While I believe there is no data sheet that tells us how many of those internal instructions are retired by cycle, simple logic guarantees that the number is higher that the known x86-level instructions per cycles.
So, it is almost certain that Intel and AMD are retiring just as many instructions per cycles as the M1. It’s just that the instructions retired are the internal ones. As proof, one only has to look at the block diagram of such processor: the number of different execution units clearly indicates that they are handling more internal instructions than the IPC numbers indicates.
it is almost certain that Intel and AMD are retiring just as many instructions per cycles as the M1.
I think you meant to refer to micro-operations.
You may subscribe to this blog by email.