I have designed a benchmark that I run in different programming languages. Two languages that I like are Go (from Google) and Java (from Oracle). My expectation would be that Go and Java should have similar performance in due time. Indeed, both are modern garbage-collected languages supported by major corporations with deep pockets.
It used to be that Go was handicapped in my benchmark because the language did not offer a simple way to compute population counts using specialized processor instructions. Go 1.9 changes that with the introduction of the math/bits package. I was hoping for Go to catch up to Java.
So how does Go fare against Java now?
Let us sum this up:
create | count | iterate | |
Java 8 | 5 ms | 0.6 ms | 4 ms |
Go 1.9 | 10 ms | 1.2 ms | 6 ms |
So Go is often running at half the speed compared to Java. The Count benchmark in Go is about two times faster than it was prior to Go 1.9, but it is still far from Java.
What could explain such a difference? If you look at the code, it is all very simple loops.
Modern programming involves many short functions. Programming with short functions makes code review much easier, and it avoids code duplication.
Calling a function can be a relatively expensive process. The system has to copy the data in the right registers, allocate stack memory, and so forth. Moreover, functions are somewhat opaque to the optimizing compiler so that many easy optimizations are simply not possible without function inlining.
As programmers, we usually do not worry about the performance cost of having many function calls because we expect the function calls to be mostly optimized away. One way the system optimizes away function calls is by “inlining” the function… in effect, it “copies” the code in place, so that there is no function call at all. That’s hardly bleeding edge technology in 2017: most optimizing compilers have been doing it for decades.
Inlining is not always useful: when used indiscriminately, it can grow the size of the executables. You do not want to inline large functions. But if you have a long loop that repeatedly calls the same small function, you can expect to greatly benefit by inlining the function in question.
Yet, as far as I can tell, Go is terribly shy about inlining function calls even when it would obviously make sense. We can verify that Go does not inline small hot functions within tight loops in my benchmark by examining the assembly:
$ go test -c && go tool objdump -S -s BenchmarkLemireCreate bitset.test |grep CALL 0x5193d3 e848c9ffff CALL github.com/willf/bitset.(*BitSet).Set(SB)
What the Set function does is simple: it sets a single bit in a single machine word, but Go won’t inline it, possibly because there is a branch involved. We can double the speed of the Create test if we manually inline the Set function calls and do some minor surgery on how the data gets reallocated.
Even when Go apparently inline the function calls, as in the math/bits calls, it seems to surround the single instruction that should be emitted by guarding code. In effect, the processor checks that the instruction in question is supported each and every time it needs to be called. That can probably reduce the performance of the instruction by a factor of two.
Should you care? I think you should. Having half the speed means that you might end up using two cores to solve the same problem in the same time. That’s twice the energy usage! And that is if you are lucky: parallelizing is hardly free from complexity and pitfalls.
Of course, my benchmark is probably not representative of whatever systems people build, but it is also not crazily unrealistic. It is likely representative of many performance bottlenecks.
Go has to get inlining right.
Further reading:
- Proposal: cmd/compile: add a go:inline directive
- cmd/compile: improve inlining cost model
- Mid-stack inlining in the Go compiler
- cmd/compile: enable mid-stack inlining
- cmd/compile: for-loops cannot be inlined
- cmd/compile: inline function calls that return a closure
George Tankersley has a GopherCon 2017 talk (I want to Go fast) about how to get Go to inline small functions. He gets good results, but the process is not pretty. It is clearly a case where he is fighting against the optimizing compiler in a manner that should not be necessary.
Appendix. My results are reproducible.
Let us first run the benchmark using Java 8:
$ git clone https://github.com/lemire/microbenchmarks $ cd microbenchmarks $ mvn install $ java -cp target/microbenchmarks-0.0.1-jar-with-dependencies.jar me.lemire.microbenchmarks.bitset.Bitset Benchmark Mode Samples Score Error Units m.l.m.b.Bitset.construct avgt 5 0.008 ± 0.001 s/op m.l.m.b.Bitset.count avgt 5 0.001 ± 0.000 s/op m.l.m.b.Bitset.iterate avgt 5 0.005 ± 0.000 s/op
Let us run the benchmark using Go 1.9:
$ go get github.com/willf/bitset $ cd ~/go/src/github.com/willf/bitset $ go test -bench=Lemire {0,1,2,3,4,5,6,7,8,9} goos: linux goarch: amd64 pkg: github.com/willf/bitset BenchmarkLemireCreate-2 100 10440889 ns/op BenchmarkLemireCount-2 1000 1271236 ns/op BenchmarkLemireIterate-2 200 6111728 ns/op PASS ok github.com/willf/bitset 4.414s
It’s impressive that you’ve compared the simple things such as count and iterate. Often the benchmarks on the web are comparing a 3 kb static page which often is not a real indication of how fast or slow a programming language actually is.
Inlining is a simple process, why do you think it wasn’t implemented in Go?
I’m not exactly sure why function inlining has been such a low priority for Go. I realize that it is not trivial, but it is not exactly like sending a rocket on the Moon either.
I do not fully grok the culture behind Go.
To be clear, Go is impressive and I have the utmost respect for the people developing it. I just don’t understand what makes them tick.
You should have disassembled the entire function rather than grep for a CALL, as you will find that is present as a fallback if the processor does not support the POPCNT instruction.
cnt += bits.OnesCount64(x)
0x1093534 0fb63d22810c00 MOVZX 0xc8122(IP), DI
0x109353b 4084ff TESTL DI, DI
0x109353e 7407 JE 0x1093547
0x1093540 f3480fb8f6 POPCNT SI, SI
0x1093545 ebd6 JMP 0x109351d
0x1093547 4889442418 MOVQ AX, 0x18(SP)
0x109354c 4889542410 MOVQ DX, 0x10(SP)
0x1093551 48894c2420 MOVQ CX, 0x20(SP)
0x1093556 48893424 MOVQ SI, 0(SP)
0x109355a e801ffffff CALL math/bits.OnesCount64(SB)
0x109355f 488b742408 MOVQ 0x8(SP), SI
0x1093564 488b442418 MOVQ 0x18(SP), AX
0x1093569 488b4c2420 MOVQ 0x20(SP), CX
0x109356e 488b542410 MOVQ 0x10(SP), DX
0x1093573 488b5c2440 MOVQ 0x40(SP), BX
0x1093578 eba3 JMP 0x109351d
You are correct that the POPCNT instruction is issued. Still. Let us dig into it a bit more. Let us study the assembly. The “JMP 0x109351d” part gets us back to the beginning of the loop. SI contains the word I care about. “POPCNT SI, SI” does what I want. DX contains the running count. I understand this part of the code pretty much:
As far as I am concerned there is one load too many, but maybe we get it for free anyhow. So ok. But what do these three instructions before the POPCNT call do?
The “JE 0x1093547” ships us to the CALL that should never be needed. You write that this call is present as “a fallback if the processor does not support the POPCNT instruction.” Are you actually telling me that each and every time, within a tight loop, the processor checks whether the hardware supports POPCNT, reloading the relevant information in a register… each and every time… within a tight loop??? Can’t we assume that if the processor supports POPCNT at the beginning of the loop, it will support it at the following iterations?
In the grand scheme, the check in the loop will make little difference. Why? It is likely that the “support_popcnt” value is going to be in the L1 (processor) cach and branch prediction will likely succeed meaning we won’t see any stalls. The overhead of a couple of extra instructions will be immeasurably small.
Check out [this gist](https://gist.github.com/stuartcarnie/773eaec7b0d549507533a64a6d191a8d).
* `BenchmarkLoopOne` has the `if isSupported` check in the loop and generates an ADD instruction, so it looks almost identical to the `POPCNT` code. This loop executes at about `0.59ns` per iteration.
* `BenchmarkLoopTwo` has no check and simply adds values. This loop executes at about `0.39ns` per iteration.
The difference is about 0.2ns or 200 picoseconds. Light travels about 6cm in that time 🙂
Of course the 3-instruction check is cheap but so is a popcnt (throughput of one instruction per cycle). In fact, you can sum up the population count of an array using one cycle per 64-bit word in C (see https://arxiv.org/pdf/1611.07612.pdf) and that’s not even close to the best you can do which is at least twice better (which at least one C compiler can do on its own).
Meanwhile Go is running at half the speed of Java on this benchmark… Java!
It is perfectly sensible to say that one does not care about my benchmarks… but I think that if you accept my benchmarks, then you have to agree that Go fares poorly at them.
My guess as to what gets Go into this kind of inefficiency is that it queries the CPU features when the program starts… which is fine, but then it has no concept that this state is unchanging afterward.
Go is especially interesting to me because Go applications are impressively fast for a GC language, yet at the same time it’s clear that Go could be *much faster*.
The language’s ethos is full of contradictions. It’s new, but it feels old in so many ways. Go is… dusty. When I looked at the compiler source a couple of years ago, I was surprised to see what looked like Rob Pike’s old Plan9 asm files. At the time Go was seemingly unaware of CPU instructions introduced since 2000 or so – there was no vectorization, no BMI, not even string compare instructions from SSE 4.2.
A lot has improved since then. For example, Klaus Post’s SIMD optimizations of deflate and gzip made it into Go 1.7.
But there’s still a lack of modernity in the language that keeps is slower than it should be. Interrelated problems: the lack of good inlining makes PGO less useful, and not surprisingly Go doesn’t have PGO. The lack of attention to vectorization, on both the front-end (explicit vectorization syntax or hints that a developer could employ), and the back-end wrt auto-vectorization, is another area where Go seems out of step with modern computer science and optimization.
The good news is that Go has lots of headroom, still! They’ve proved that you can build a precompiled language with GC that is much faster than Python and Ruby, and as fast as Java and C# – and that you can also have a much simpler and faster build process and toolchain along with the unusually fast runtime speed.
Go has served as a very useful demonstration of what’s possible, of how much better we can do than Python, Ruby, Java, .NET, etc. I feel like the next step is to show that we can have all the good things about Go along with much faster applications, vectorization, GPU/OpenCL, and the trappings of modernity like generics and much better syntax.
In fact, I think with the right team, a proprietary Go compiler and IDE could thrive in the marketplace. Go can be a lot faster, and some people would be willing to pay for it.
Go has served as a very useful demonstration of what’s possible, of how much better we can do than Python, Ruby, Java, .NET, etc.
I want to stress that I am not, at all, dismissive of Go as an engineering achievement.
Go uses function calls as entry points into the scheduler, there is a small crop of infinite loop bug reports about this. Maybe not doing function inlining is strategic laziness? If you inlined aggressively, that scheduler problem could go from “rare issue” to “common bug”
Thanks for the article. It was really informant. Besides I’d written a half-compiler and an interpreter, I never get near to code generation, so, I’m probably the most noob person to comment your article (really, when you say “right here, it’s clear this is a problem”, it seems old greek language to me…).
I like performance comparisons, but after reading “Ruby has been fast enough for 13 years” (https://news.ycombinator.com/item?id=11707973), I’d faced the truth: It doesn’t matter. My users doesn’t matter if is took 3 or 30 machine instructions more to show the kitty pictures it’s searching for, she/he will not care.
I’m not saying that optimizations is not important, but comparing a language that has been worked on by hundreds (maybe thousands) engineers more than two decades, with another that has about 7 years, is not fair.
To me, Go is “good enough”. My code is simpler, easier to understand and maintain, and makes concurrency light years more enjoyable than Java threads. I’ll not start to talk about memory usage.
Considering what I know about Go core team (and community), they are problems-oriented. When that “inlining lackness” become a real problem, it will be addressed.
It is entirely reasonable to say that Go is fast enough for a given application domain.
I also have great respect for the Go team. Please don’t misunderstand.
Time and engineering effort helps, but I don’t think it is true that I am being unfair to Go by comparing it to Java. Time buys you features but not necessarily more performance. It is not obvious to me that these performance concerns will automatically be addressed if we only give it time.
Software also gets more bloated with time. It does not always get leaner and more efficient. There is tension between adding more features and optimizing performance.
Did Ruby get faster over time?
As for whether Ruby is fast enough… Of course it is… for some things but not others.
Google will not rewrite its search engine in Ruby. Oracle won’t rewrite its database engine in Ruby.
Looking at `go test -gcflags=”-m”` output
extendBitSetMaybe() is considered too complex for inlining, which then makes Set() a non-leaf node, also unsuitable for inlining.
extendBitSetMaybe()’s complexity can be reduced.
And…
./bitset.go:165:15: inlining call to (*BitSet).Set …
./bitset.go:165:15: inlining call to (*BitSet).extendSetMaybe …
Do you happen to have a copy of this simplified code with the same functionality?
Using slice capacity seems not useful, if explicitly tracking the bit length. So can drop a two of the conditions.
Only test failure was in Equal, which assumed len(b.set) and b.length are related. Trivial fix, to range over a subslice of b.set based on b.length.
https://gist.github.com/renthraysk/146d9162b4d4cf58a26d797bddde4965
Maybe you could issue a PR on Will’s code?
Actually looking at cpuprofiles, seems at least 50% of the time is spent on two lines in extendSetMaybe(). The make() and copy().
make() ensuring new memory is zero’d, and then copy() immediately after.
Java resizes arrays using an “intrinsic” that has been hand tuned.
So we can expect go to get much faster in the future then. That’s good!