Performance overhead when calling assembly from Go

The Go language allows you to call C functions and to rewrite entire functions in assembly. As I have previously documented, calling C functions from Go comes with a significant overhead. It still makes sense, but only for sizeable functions, or when performance is irrelevant.

What about functions written in assembly? To illustrate the performance constraints, I am going to use an example designed by Jason Aten.

Recent Intel processors have an instruction (tzcnt) that counts the “number of trailing zeroes” of an integer. That is, given a non-zero unsigned integer, you count the number of consecutive zeros starting from the least significant bits. For example, all odd integers have no trailing zero, all integers divisible by 4 have at least two trailing zeros and so forth. You might reasonably wonder why anyone would care about such an operation… it is often used to iterate over the 1-bit in a word. It is useful in data compression, indexing and cryptography.

A fast way to compute the number of trailing zeros without special instructions is as follows… (see Leiserson and Prokop, Using de Bruijn Sequences to Index a 1 in a Computer Word, 1998)

table[((x&-x)*0x03f79d71b4ca8b09)>>58]

where table is a short array of bytes that fits in a cache line…

var table = []byte{
	0, 1, 56, 2, 57, 49, 28, 3, 61,
        58, 42, 50, 38, 29, 17, 4, 62, 
        47, 59, 36, 45, 43, 51, 22, 53,
        39, 33, 30, 24, 18, 12, 5, 63, 
        55, 48, 27, 60, 41, 37, 16, 46,
        35, 44, 21, 52, 32, 23, 11,54,
        26, 40, 15, 34, 20, 31, 10, 25,
        14, 19, 9, 13, 8, 7, 6,
}

Such a function is going to be fast, using only a handful of machine instructions and running in a handful of cycles. Still, Intel’s tzcnt instruction is superior, as it is a single instruction of a cost comparable to a single multiplication. Roughly speaking, we could expect tzcnt to be twice as fast.

Go can call tzcnt through a function written in assembly. So it seems that Go can easily make use of such instructions. Sadly no. Based on Aten’s code, I designed a benchmark. I am using a test server with a Skylake processor running at a flat 3.4 GHz, and my benchmark measures the instruction throughput. I think that the results speak for themselves:

pure Go (de Bruijn)3.55 cycles/call
assembly11.5 cycles/call

In this instance, the function that calls tzcnt (and does little else) runs at nearly half the speed of the pure Go function. Evidently, Go does not take the assembly and inline it.

Programmers have asked the Go authors to inline assembly calls, but there seems to be little support from the core Go team for such an approach.

My point is not that you can’t accelerate Go code using functions written in assembly but rather that if the function is tiny, the function-call overhead will make the performance worse. So you will be better off using pure Go.

The source code is available.

4 thoughts on “Performance overhead when calling assembly from Go”

  1. What “more modern” languages have you evaluated with respect to their overhead of calling C or assembly?
    So Go comes at some overhead, what about Rust, Julia, Swift? These should come at zero overhead, as they use LLVM?
    Vala? As it apparently is compiled ‘into’ C code, there shouldn’t be much overhead.
    Dart? Apparently you can use SIMD in Dart somehow. Probably not when compiling into JavaScript though.
    I have always been wondering what language to use for my next project; as I have always been hitting the limits of Java. Go and Rust have been two candidates to learn.

    1. So Go comes at some overhead, what about Rust, Julia, Swift? These should come at zero overhead, as they use LLVM?

      Swift can call C without overhead…

      http://lemire.me/blog/2016/09/29/can-swift-code-call-c-code-without-overhead/

      I don’t know about Julia and Rust.

      Go and Rust have been two candidates to learn.

      Go can be learned quickly. So there is that.

      Swift is more challenging but a lot of fun, and certainly more interesting than Java.

      I don’t know a lot about Rust. Looked ok at a glance.

  2. Note that while tzcnt is “new,” it is merely a variant of the good old bsf instruction with some extra behaviour. bsf was introduced with the 80386.

Leave a Reply

Your email address will not be published. Required fields are marked *