Go is a new programming language invented by Google engineers. Apparently, it came about because they were tired to wait for their C++ code to compile.

To run Go programs, you need to compile them. However, compilation is so amazingly fast that Go may as well be an interpreted language like Python or JavaScript.

Compared to C++ or Java, Go is a very simple language. I am no expert, but I learned the basics in a few hours. This means that, even though I have only been programming in Go for a few days, I can read most Go programs and instantly understand them. In contrast, even though I have 15 years of experience in C++, I often cannot understand C++ code without effort.

In theory, Go can offer the performance of C++. In practice, Go has performance superior to Python and JavaScript, but far inferior to C++ and Java.

My friend Will Fitzgerald wrote what has become a reference implementation of the bitset data structure in Go. I felt that it could be faster. Will was nice enough to let me contribute some enhancements to his bitset library.

Go makes it easy, or even trivial, to call C functions. So I thought optimizing the code was a simple matter of rewriting the performance sensitive parts in C. Let us see how it worked out.

Will’s bitset implementation is an array of 64-bit integers were some of the bits are set to 1 and others to 0. An expensive operation is to count the number of bits set to 1. It is implemented as a tight loop which repeatedly calls a Hamming weight function written in Go (popcount_2):

for _, word := range b.set {
	cnt += popcount_2(word)
}

The popcount_2 isn’t exactly free:

func popcount_2(x uint64) uint64 {
	x -= (x >> 1) & m1             
	x = (x & m2) + ((x >> 2) & m2) 
	x = (x + (x >> 4)) & m4        
	x += x >> 8                    
	x += x >> 16                   
	x += x >> 32                   
	return x & 0x7f
}

In C, when using GCC-like compiler, we would simply call an intrinsic (__builtin_popcountl). Presumably, it is as fast or faster than anything we can come up with. Go makes it really easy to call a C function:

C.__builtin_popcountl(C.ulong(word)))

Alternatively, we can write the entire function in C and call it from Go:

unsigned int totalpop(void * v, int n) {
    unsigned long * x = (unsigned long *) v;
    unsigned int a = 0;
    int k = 0;
    for(; k < n ; ++k) 
        a+= __builtin_popcountl(x[k]);
    return a;
}

So how do these three alternatives compare? I created a small benchmark (that you will find in the bitset package) and tested all three alternatives.

    microseconds/operation 
Pure Go 12
Calling __builtin_popcountl 130
Entire rewrite in C 9

So while rewriting the entire function in Go helped, repeatedly calling __builtin_popcountl made things much worse. This reflects the fact that calling a C function from Go is expensive. In contrast, calling a C function from C++ is dirt cheap.

Beside the high cost of calling C function, I have also been unable to call SSE intrinsics in Go while using the default Go compiler. Indeed, the function I would really like to call, when it is available, is _lzcnt_u64. I am not sure that it is possible to do so by default.

Conclusion: Go is a wonderful language. In theory, you could ensure that it is fast by calling C functions when needed. In practice, the overhead of C function calls in Go is likely to be a bottleneck when working with functions that run for less than a millisecond. You either have to rewrite your entire functions in C, or live with the poor performance.

18 Comments

  1. There’s a decent amount of overhead to using CGO. It’s definitely not meant for small, quick functions. You can actually see all the intermediate C wrapper code that’s generated by passing the -work flag to “go build”.

    I created a gist for a simple CGO invocation and copied all the files from the work directory:

    https://gist.github.com/benbjohnson/f33106b86036bcc72ba7

    I have had a lot of success with calling into CGO functions that perform all their work without context switching. Go structs also seem to work well when passed to C (as long as they’re not GC’d) and I’ve used some crazy mixes of Go+C+LLVM and it all works fine.

    Comment by Ben Johnson — 14/2/2014 @ 10:58

  2. An alternative for small functions is to use the Go C compiler or even the assembler. Once you get past the slightly quirky coding, it’s quite ‘nice’. Dave Cheney did a simple intro blog on the subject if I remember correctly. Also the pkg source has a few implementations. Look for .goc files and their .go callers.

    Comment by Tony Wilson — 14/2/2014 @ 12:14

  3. @Tony

    I’d be very interested if anyone could improve Will’s bitset class. Can you do better than I did?

    Comment by Daniel Lemire — 14/2/2014 @ 12:39

  4. I’ll give it a Go – just don’t hold your breath; I’m busy for the next few days.

    Comment by Tony Wilson — 14/2/2014 @ 14:10

  5. Going from 12usec to 9usec would be a 33% improvement. However, I found the time of my whole program to be dominated by GC.

    While Will’s library offers InPlace functions, to use them with a reused buffer requires copying which lead to a slower overall program.

    Finally, by not supporting dynamic bitset growth, I was able to avoid all the work that Willf’s library performs when doing Set() and Test(). This trades code flexibility for performance, but has not been a problem in my use-cases.

    Comment by Aaron Blohowiak — 14/2/2014 @ 17:12

  6. The only reason a C++ compiler would outperform Go in this micro-benchmark, and it is a micro-benchmark, is that the built-in compiles to a single CPU instruction on contemporary CPUs. Go doesn’t have built-ins like that. Also, because your C++ code relies on built-ins, it’s actually inlining and unrolling the loop, which Go’s optimizer isn’t doing on its own yet.

    I haven’t tested it, but I’m willing to bet that you’ll get faster if you inline the popcount_2 function, and you’ll get faster still if you unroll the loop manually. The former removes the relatively expensive subroutine call overhead, and the latter amortizes the looping overhead.

    Comment by Munch — 14/2/2014 @ 18:30

  7. Actually, I just tested myself — and surprisingly, it makes only a small difference. I was able to reduce my time/op to 11.8us, which is an improvement, but not terribly significant. It’s also very finicky, and noise from multitasking swamps the improvement.

    Scratch that idea. :)

    Comment by Munch — 14/2/2014 @ 19:04

  8. Sorry; baseline on my machine was 13.3us/op. Missed that important info. :)

    Comment by Munch — 14/2/2014 @ 19:05

  9. @aaron

    Finally, by not supporting dynamic bitset growth, I was able to avoid all the work that Willf’s library performs when doing Set() and Test(). This trades code flexibility for performance, but has not been a problem in my use-cases.

    I am not sure you are gaining anything at all by trading away this flexibility. I would like to see a benchmark.

    The thing is… the software has to check whether the index is in range. What your code probably does is just fail when it is out of range. Will expands the underlying array.

    It is possible that your version if faster, but there is no reason, in principle, that it should be measurably faster.

    Comment by Daniel Lemire — 14/2/2014 @ 22:23

  10. @Munch

    Go doesn’t have built-ins like that.

    And it is a shame. It would take a few hours for the language authors to implement it and it could instantly speed up many algorithms.

    Comment by Daniel Lemire — 14/2/2014 @ 22:26

  11. At last… My tuppence worth can be found at github.com/tHinqa/bitset

    Comment by Tony Wilson — 18/2/2014 @ 20:03

  12. Added the forgotten clz fallback code and tests. Not sure he clz 0 also undefined for the LZCNT cpu instruction but it shouldn’t be.

    Comment by Tony Wilson — 19/2/2014 @ 7:08

  13. he=is Damn autocorrect

    Comment by Tony Wilson — 19/2/2014 @ 7:11

  14. if Damn human error

    Comment by Tony Wilson — 19/2/2014 @ 7:13

  15. Just found that CPUID code is missing a MOVL $1!,AX at the beginning. Will fix and comment again – sorry about that

    Comment by Tony Wilson — 19/2/2014 @ 17:27

  16. Fixed

    Comment by Tony Wilson — 19/2/2014 @ 21:41

  17. POPCNT instruction tested using software.intel.com/en-us/articles/intel-software-development-emulator that I came across catching up on my freshmeat/freecode feed. Nice tool. Allows you to emulate any Intel CPU. Slow though. 365 micros for POPCNT 57 for hamming. Avoid 6.2 if windows; it’s broken. Get the 2013-11 ver.

    Ok. I think that’s my QED to leave the rest to people with real machines. It’s been interesting and I’m considering a package of intrinsics for Go now.

    Comment by Tony Wilson — 20/2/2014 @ 9:22

  18. C++ is a strong incumbent, but I think Go will gradually get there in terms of raw performance.

    What is more interesting and surprising is that Go has ended up eating share from Java and even Python.

    Comment by Vivek Haldar — 22/2/2014 @ 16:28

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress