Measuring your system’s performance using software (Go edition)

When programming software, we are working over an abstraction over a system. The computer hardware may not know about your functions, your variables, and your data. It may only see bits and instructions. Yet to write efficient software, the programmer needs to be aware of the characteristics of the underlying system. Thankfully, we can also use the software itself to observe the behavior of the system through experiments.

Between the software and the hardware, there are several layers such as the compilers, the operating system, and the hardware. A good programmer should take into account these layers when needed. A good programmer must also understand the behavior of their software in terms of these layers.

Benchmarks in Go

To measure the performance, we often measure the time required to execute some function. Because most functions are fast, it can be difficult to precisely measure the time that takes a function if we run it just once. Instead, we can run the function many times, and record the total time. We can then divide the total time by the number of executions. It can be difficult to decide how many times we should execute the function: it depends in part on how fast a function is. If a function takes 6 seconds to run, we may not want or need to run it too often. An easier strategy is to specify a minimum duration and repeatedly call a function until we reach or exceed the minimum duration.

When the function has a short execution time, we often call the benchmark a microbenchmark. We use microbenchmarks to compare different implementations of the same functionality or to better understand the system or the problem. We should always keep in mind that a microbenchmark alone cannot be used to justify a software optimization. Real-world performance depends on multiple factors that are difficult to represent in a microbenchmark.

Importantly, all benchmarks are affected by measurement errors, and by interference from the system. To make matters worse, the distribution of timings may not follow a normal distribution.

All programming languages provide the ability to run benchmarks. In Go, the tools make it easy to write benchmarks. You can import the testing package and create a function with the prefix Benchmark and a parameter of pointer type `testing.B. For example, the following program benchmarks the time required to compute the factorial of 10 as an integer:

package main

import (
    "fmt"
    "testing"
)

var fact int

func BenchmarkFactorial(b *testing.B) {
    for n := 0; n < b.N; n++ {
        fact = 1
        for i := 1; i <= 10; i++ {
            fact *= i
        }
    }
}

func main() {
    res := testing.Benchmark(BenchmarkFactorial)
    fmt.Println("BenchmarkFactorial", res)
}

If you put functions with such a signature (BenchmarkSomething(b *testing.B)) as part of your tests in a project, you can run them with the command go test -bench .. To run just one of them, you can specify a pattern such as go test -bench Factorial which would only run benchmark functions containing the word Factorial.

The b.N field indicates how many times the benchmark function runs. The testing package adjusts this value by increasing it until the benchmark runs for at least one second.

Measuring memory allocations

In Go, each function has its own ‘stack memory’. As the name suggests, stack memory is allocated and deallocated in a last-in, first-out (LIFO) order. This memory is typically only usable within the function, and it is often limited in size. The other type of memory that a Go program may use is heap memory. Heap memory is allocated and deallocated in a random order. There is only one heap shared by all functions.

With the stack memory, there is no risk that the memory may get lost or misused since it belongs to a specific function and can be reclaimed at the end of the function. Heap memory is more of a problem: it is sometimes unclear when the memory should be reclaimed. Programming languages like Go rely on a garbage collector to solve this problem. For example, when we create a new slice with the make function, we do not need to worry about reclaiming the memory. Go automatically reclaims it. However, it may still be bad for performance to constantly allocate and deallocate memory. In many real-world systems, memory management becomes a performance bottleneck.

Thus it is sometimes interesting to include the memory usage as part of the benchmark. The Go testing package allows you to measure the number of heap allocation made. Typically, in Go, it roughly corresponds to the number of calls to make and to the number of objects that the garbage collector must handle. The following extended program computers the factorial by storing its computation in dynamically allocated slices:

package main

import (
    "fmt"
    "testing"
)

var fact int

func BenchmarkFactorial(b *testing.B) {
    for n := 0; n < b.N; n++ {
        fact = 1
        for i := 1; i <= 10; i++ {
            fact *= i
        }
    }
}
func BenchmarkFactorialBuffer(b *testing.B) {
    for n := 0; n < b.N; n++ {
        buffer := make([]int, 11)
        buffer[0] = 1
        for i := 1; i <= 10; i++ {
            buffer[i] = i * buffer[i-1]
        }
    }
    b.ReportAllocs()
}

func BenchmarkFactorialBufferLarge(b *testing.B) {
    for n := 0; n < b.N; n++ {
        buffer := make([]int, 100001)
        buffer[0] = 1
        for i := 1; i <= 100000; i++ {
            buffer[i] = i * buffer[i-1]
        }
    }
    b.ReportAllocs()
}

func main() {
    res := testing.Benchmark(BenchmarkFactorial)
    fmt.Println("BenchmarkFactorial", res)
    resmem := testing.Benchmark(BenchmarkFactorialBuffer)
    fmt.Println("BenchmarkFactorialBuffer", resmem, resmem.MemString())
    resmem = testing.Benchmark(BenchmarkFactorialBufferLarge)
    fmt.Println("BenchmarkFactorialBufferLarge", resmem, resmem.MemString())
}

If you run such a Go program, you might get the following result:

BenchmarkFactorial 90887572             14.10 ns/op
BenchmarkFactorialBuffer 88609930               11.96 ns/op        0 B/op              0 allocs/op
BenchmarkFactorialBufferLarge     4408      249263 ns/op   802816 B/op         1 allocs/op

The last function allocates 802816 bytes per operation, unlike the first two. In this instance, if Go determines that data is not referenced after the function returns (a process called ‘escape analysis’), and if the amount of memory used is sufficiently small, it will avoid allocating the memory to the heap, preferring instead stack memory. In the case of the last function, the memory usage is too high, hence there is allocation on the heap rather than the stack.

Measuring memory usage

Your operating system provides memory to a running process in units of pages. The operating system cannot provide memory in smaller units than a page. Thus if you allocate memory in a program, it may either cost no additional memory if there are enough pages already; or it may force the operating system to provide more pages.

The size of a page depends on the operating system and its configuration. It can often vary between 4 kilobytes and 16 kilobytes although much larger pages are also possible (e.g., 1 gigabyte).

A page is a contiguous array of virtual memory addresses. A page may also represent actual physical memory. However, operating systems tend to only map used pages to physical memory. An operating system may provide a nearly endless supply of pages to a process, without ever mapping it to physical memory. Thus it is not simple to ask how much memory a program uses. A program may appear to use a lot of (virtual) memory, while not using much physical memory, and inversely.

The page size impacts both the performance and the memory usage. Allocating pages to a process is not free, it takes some effort. Among other things, the operating system cannot just reuse a memory page from another process as is. Doing so would be a security threat because you could have indirect access to the data stored in memory by another process. This other process could have held in memory your passwords or other sensitive information. Typically an operating system has to initialize (e.g., set to zero) a newly assigned page. Furthermore, mapping the pages to their actual physical memory also takes time. To accelerate the mapping process, modern systems often use a translation lookaside buffer to keep the map in the cache. When the translation lookaside buffer runs out of space a manual computation of the page mapping may be required, an expensive process. Thus large pages may improve the performance of some programs. However, large pages force the operating system to provide memory in larger chunks to a process, potentially wasting precious memory. You can write a Go program which prints out the page size of your system:

import (
    "fmt"
    "os"
)

func main() {
    pageSize := os.Getpagesize()
    fmt.Printf("Page size: %d bytes (%d KB)\n", pageSize, pageSize/1024)
}

Go makes it relatively easy to measure the number of pages allocated to a program by the operating system. Nevertheless, some care is needed. Because Go uses a garbage collector to free allocated memory, there might be a delay between the moment you no longer need some memory, and the actual freeing of the memory. You may force Go to call immediately the garbage collector with the function call runtime.GC(). You should rarely deliberately invoke the garbage collector in practice, but for our purposes (measuring memory usage), it is useful.

There are several memory metrics. In Go, some of the most useful are HeapSys and HeapAlloc. The first indicates how much memory (in bytes) has been given to the program by the operating system. The second value, which is typically lower indicates how much of that memory is actively in used by the program.

The following program allocates ever larger slices, and then ever smaller slices. In theory, the memory usage should first go up, and then go down:

package main

import (
    "fmt"
    "os"
    "runtime"
)

func main() {
    pageSize := os.Getpagesize()
    var m runtime.MemStats
    runtime.GC()
    runtime.ReadMemStats(&m)
    fmt.Printf(
        "HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
        float64(m.HeapSys)/1024.0/1024.0,
        float64(m.HeapAlloc)/1024.0/1024.0,
        float64(m.HeapSys)/float64(pageSize),
    )
    i := 100
    for ; i < 1000000000; i *= 10 {
        runtime.GC()
        s := make([]byte, i)
        runtime.ReadMemStats(&m)
        fmt.Printf(
            "%.3f MiB, HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
            float64(len(s))/1024.0/1024.0,
            float64(m.HeapSys)/1024.0/1024.0,
            float64(m.HeapAlloc)/1024.0/1024.0,
            float64(m.HeapSys)/float64(pageSize),
        )
    }
    for ; i >= 100; i /= 10 {
        runtime.GC()
        s := make([]byte, i)
        runtime.ReadMemStats(&m)
        fmt.Printf(
            "%.3f MiB, HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
            float64(len(s))/1024.0/1024.0,
            float64(m.HeapSys)/1024.0/1024.0,
            float64(m.HeapAlloc)/1024.0/1024.0,
            float64(m.HeapSys)/float64(pageSize),
        )
    }
    runtime.GC()
    runtime.ReadMemStats(&m)
    fmt.Printf(
        "HeapSys = %.3f MiB, HeapAlloc =  %.3f MiB,  %.3f pages\n",
        float64(m.HeapSys)/1024.0/1024.0,
        float64(m.HeapAlloc)/1024.0/1024.0,
        float64(m.HeapSys)/float64(pageSize),
    )
}

The program calls os.Getpagesize() to get the underlying system’s memory page size in bytes as an integer, and assigns it to a variable pageSize. It declares a variable m of type runtime.MemStats, which is a struct that holds various statistics about the memory allocator and the garbage collector. The program repeatedly calls runtime.GC() to trigger a garbage collection cycle manually, which may free some memory and make it available for release. It calls runtime.ReadMemStats(&m) to populate the m variable with the current memory statistics. We can reuse the same variable m from call to call. The purpose of this program is to demonstrate how the memory usage of a Go program changes depending on the size and frequency of memory allocations and deallocations, and how the garbage collector and the runtime affect the memory release. The program prints the memory usage before and after each allocation, and shows how the m.HeapSys, m.HeapAlloc, and m.HeapSys / pageSize values grow and shrink accordingly.

If you run this program, you may observe that a program tends to hold on to the memory you have allocated and later released. It is partly a matter of optimization: acquiring memory takes time and we wish to avoid giving back pages only to soon request them again. It illustrates that it can be challenging to determine how much memory a program uses.

The program may print something like the following:

$ go run mem.go
HeapSys = 3.719 MiB, HeapAlloc =  0.367 MiB,  238.000 pages
0.000 MiB, HeapSys = 3.719 MiB, HeapAlloc =  0.367 MiB,  238.000 pages
0.001 MiB, HeapSys = 3.719 MiB, HeapAlloc =  0.383 MiB,  238.000 pages
0.010 MiB, HeapSys = 3.688 MiB, HeapAlloc =  0.414 MiB,  236.000 pages
0.095 MiB, HeapSys = 3.688 MiB, HeapAlloc =  0.477 MiB,  236.000 pages
0.954 MiB, HeapSys = 3.688 MiB, HeapAlloc =  1.336 MiB,  236.000 pages
9.537 MiB, HeapSys = 15.688 MiB, HeapAlloc =  9.914 MiB,  1004.000 pages
95.367 MiB, HeapSys = 111.688 MiB, HeapAlloc =  95.750 MiB,  7148.000 pages
953.674 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  954.055 MiB,  68332.000 pages
95.367 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  95.750 MiB,  68332.000 pages
9.537 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  9.914 MiB,  68332.000 pages
0.954 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  1.336 MiB,  68332.000 pages
0.095 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.477 MiB,  68332.000 pages
0.010 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.414 MiB,  68332.000 pages
0.001 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.383 MiB,  68332.000 pages
0.000 MiB, HeapSys = 1067.688 MiB, HeapAlloc =  0.375 MiB,  68332.000 pages
HeapSys = 1067.688 MiB, HeapAlloc =  0.375 MiB,  68332.000 pages

Observe how, at the very beginning and at the very end, over a third of a megabyte of memory (238 pages) is repeated as being in used. Furthermore, over 68,000 pages remain allocated to the program at the very, even though no data structure remains in scope within the main function.

Inlining

One of the most powerful optimization technique that a compile may do is function inlining: the compiler brings some of the called functions directly into the calling functions.

Go makes it easy to tell which functions are inlined. We can also easily request that the compiles does not inline by adding the line //go:noinline right before a function.

Let us consider this program which contains two benchmarks were we sum all odd integers in a range.

package main

import (
    "fmt"
    "testing"
)

func IsOdd(i int) bool {
    return i%2 == 1
}

//go:noinline
func IsOddNoInline(i int) bool {
    return i%2 == 1
}

func BenchmarkCountOddInline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        sum := 0
        for i := 1; i < 1000; i++ {
            if IsOdd(i) {
                sum += i
            }
        }
    }
}

func BenchmarkCountOddNoinline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        sum := 0
        for i := 1; i < 1000; i++ {
            if IsOddNoInline(i) {
                sum += i
            }
        }
    }
}

func main() {
    res1 := testing.Benchmark(BenchmarkCountOddInline)
    fmt.Println("BenchmarkCountOddInline", res1)
    res2 := testing.Benchmark(BenchmarkCountOddNoinline)
    fmt.Println("BenchmarkCountOddNoinline", res2)
}

In Go, the flag -gcflags=-m tells the compiler to report the main optimizations it does. If you call this program simpleinline.go and compile it with the command go build -gcflags=-m simpleinline.go, you may see the following:

$ go build -gcflags=-m simpleinline.go
./simpleinline.go:8:6: can inline IsOdd
./simpleinline.go:21:12: inlining call to IsOdd
...

If you run the benchmark, you should see that the inlined version is much faster:

$ go run simpleinline.go
BenchmarkCountOddInline  3716786           294.6 ns/op
BenchmarkCountOddNoinline  1388792         864.8 ns/op

Inlining is not always beneficial: in some instances, it can generate large binaries and it may even slow down the software. However, when it is applicable, it can have a large beneficial effect.

Go tries as hard as possible to inline functions, but it has limitations. For example, compilers often find it difficult to inline recursive functions. Let benchmark two factorial functions, one that is recursive, and one that is not.

package main

import (
    "fmt"
    "testing"
)

var array = make([]int, 1000)

func Factorial(n int) int {
    if n < 0 {
        return 0
    }
    if n == 0 {
        return 1
    }
    return n * Factorial(n-1)
}

func FactorialLoop(n int) int {
    result := 1
    for i := 1; i <= n; i++ {
        result *= i
    }
    return result
}

func BenchmarkFillNoinline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        for i := 1; i < 1000; i++ {
            array[i] = Factorial(i)
        }
    }
}

func BenchmarkFillInline(b *testing.B) {
    for n := 0; n < b.N; n++ {
        for i := 1; i < 1000; i++ {
            array[i] = FactorialLoop(i)
        }
    }
}

func main() {
    res1 := testing.Benchmark(BenchmarkFillNoinline)
    fmt.Println("BenchmarkFillNoinline", res1)
    res2 := testing.Benchmark(BenchmarkFillInline)
    fmt.Println("BenchmarkFillInline", res2)
    fmt.Println(float64(res1.NsPerOp()) / float64(res2.NsPerOp()))
}

Though both FactorialLoop and Factorial are equivalent, running this program, you should find that the non-recursive function (FactorialLoop) is much faster.

Cache line

Our computers read and write memory using small blocks of memory called “cache lines”. The cache line size is usually fixed and small (e.g.,  64 or 128 bytes). To attempt to measure the cache-line size, we may use a strided copy. From a large array, we copy every N bytes to another large array. We repeat this process N times. Thus if the original array contains a 1000 bytes, we always copy 1024 bytes, whether r N = 1, N = 2, N = 4, or N = 8.

When N is sufficiently large (say N = 16), the problem should be essentially memory bound: the performance is not limited by the number of instructions, but by the system’s ability to load and store cache lines. If N is larger than twice the cache line, then I can effectively skip one cache line out of two. If N is smaller than the cache line, then every cache line must be accessed. You expect a large stride to be significant faster.

One limitation to this approach is that processors may fetch more cache lines than needed so we may overestimate the size of the cache line. However, unless memory bandwidth is overly abundant, we should expect processors to try to limit the number of cache lines fetched.

Let us run an experiment. For each stride size, we repeat 10 times and record the maximum, the minimum and the average. Consider the following program.

package main

import (
    "fmt"
    "time"
)

const size = 33554432 // 32 MB
func Cpy(arr1 []uint8, arr2 []uint8, slice int) {
    for i := 0; i < len(arr1); i += slice {
        arr2[i] = arr1[i]
    }
}

func AverageMinMax(f func() float64) (float64, float64, float64) {
    var sum float64
    var minimum float64
    var maximum float64

    for i := 0; i < 10; i++ {
        arr1 = make([]uint8, size)
        arr2 = make([]uint8, size)

        v := f()
        sum += v
        if i == 0 || v < minimum {
            minimum = v
        }
        if i == 0 || v > maximum {
            maximum = v
        }
    }
    return sum / 10, minimum, maximum
}

var arr1 []uint8
var arr2 []uint8

func run(size int, slice int) float64 {
    start := time.Now()
    times := 10
    for i := 0; i < times*slice; i++ {
        Cpy(arr1, arr2, slice)
    }
    end := time.Now()
    dur := float64(end.Sub(start)) / float64(times*slice)
    return dur
}

func main() {
    for slice := 16; slice <= 4096; slice *= 2 {
        a, m, M := AverageMinMax(func() float64 { return run(size, slice-1) })
        fmt.Printf("%10d: %10.1f GB/s [%4.1f - %4.1f]\n", slice-1, float64(size)/a, float64(size)/M, float64(size)/m)
    }
}

We may get the following result:

$ go run cacheline.go                                  1
        15:       23.6 GB/s [21.3 - 24.4]
        31:       24.3 GB/s [23.8 - 24.5]
        63:       24.2 GB/s [23.6 - 24.6]
       127:       26.9 GB/s [23.8 - 27.9]
       255:       40.8 GB/s [37.8 - 43.6]
       511:      162.0 GB/s [130.4 - 203.4]
      1023:      710.0 GB/s [652.0 - 744.4]
      2047:      976.1 GB/s [967.1 - 983.8]
      4095:     1247.4 GB/s [1147.7 - 1267.0]

We see that the performance increases substantially when the stride goes from 127 to 255. It suggests that the cache line has 128 bytes. If you run this same benchmark on your own system, you may get a different result.

The results need to be interpreted with care: we are not measuring a copy speed of 1247.4 GB/s. Rather, we can copy large arrays at such a speed if we only copy one byte out of every 4095 bytes.

CPU Cache

When programming, we often do not think directly about memory. When we do consider that our data uses memory, we often think of it as homogeneous: memory is like a large uniform canvas upon which the computer writes and reads it data. However, your main memory (RAM) is typically buffered using a small amount of memory that resides close to the processor core (CPU cache). We often have several layers of cache memory (e.g., L1, L2, L3): L1 is is typically small but very fast whereas, for example, L3 is larger but slower.

You can empirically measure the effect of the cache. If you take a small array and shuffle it randomly, will be moving data primarily in the CPU cache, which is fast. If you take a larger array, you will move data in memory without much help from the cache, a process that is much slower. Thus shuffling ever larger arrays is a way to determine the size of your cache. It may prove difficult to tell exactly how many layers of cache you have and how large each layer is. However, you can usually tell when your array is significantly larger than the CPU cache.

We are going to write a random shuffle function: Shuffle(arr []uint32). It uses an algorithm called Fisher-Yates shuffle, which involves going through the array in reverse and swapping each element with another randomly chosen from those preceding it. The function uses a seed variable to generate random numbers from a mathematical formula. For our purposes, we use a simplistic number generator: we multiply the seed by the index. The function bits.Mul64 calculates the product of two 64-bit numbers and returns the result as two 32-bit numbers: the most significant (hi) and the least significant. The most significant value is necessarily between 0 and i (inclusively). We use this most significant value as the random index. The function then exchanges the elements using multiple assignment. We call this shuffle function several times, on inputs of different sizes. We report the time normalized by the size of the input.

package main

import (
    "fmt"
    "math/bits"
    "time"
)

func Shuffle(arr []uint32) {
    seed := uint64(1234)
    for i := len(arr) - 1; i > 0; i-- {
        seed += 0x9E3779B97F4A7C15
        hi, _ := bits.Mul64(seed, uint64(i+1))
        j := int(hi)
        arr[i], arr[j] = arr[j], arr[i]
    }
}

func AverageMinMax(f func() float64) (float64, float64, float64) {
    var sum float64
    var minimum float64
    var maximum float64

    for i := 0; i < 10; i++ {
        v := f()
        sum += v
        if i == 0 || v < minimum {
            minimum = v
        }
        if i == 0 || v > maximum {
            maximum = v
        }
    }
    return sum / 10, minimum, maximum
}

func run(size int) float64 {
    arr := make([]uint32, size)

    for i := range arr {
        arr[i] = uint32(i + 1)
    }
    start := time.Now()
    end := time.Now()
    times := 0
    for ; end.Sub(start) < 100_000_000; times++ {
        Shuffle(arr)
        end = time.Now()
    }
    dur := float64(end.Sub(start)) / float64(times)
    return dur / float64(size)
}

func main() {
    for size := 4096; size <= 33554432; size *= 2 {
        fmt.Printf("%20d KB ", size/1024*4)
        a, m, M := AverageMinMax(func() float64 { return run(size) })
        fmt.Printf(" %.2f [%.2f, %.2f]\n", a, m, M)
    }
}

A possible output of running this program might be:

⚡  go run cache.go 
                  16 KB  0.70 [0.66, 0.93]
                  32 KB  0.65 [0.64, 0.66]
                  64 KB  0.64 [0.64, 0.66]
                 128 KB  0.64 [0.64, 0.67]
                 256 KB  0.65 [0.64, 0.66]
                 512 KB  0.70 [0.70, 0.71]
                1024 KB  0.77 [0.76, 0.79]
                2048 KB  0.83 [0.82, 0.84]
                4096 KB  0.87 [0.86, 0.90]
                8192 KB  0.92 [0.91, 0.95]
               16384 KB  1.10 [1.06, 1.24]
               32768 KB  2.34 [2.28, 2.52]
               65536 KB  3.90 [3.70, 4.25]
              131072 KB  5.66 [4.80, 9.78]

We see between 16 KB and 16384 KB, the time per element shuffle does not increase much even though we repeatedly double the input size. However, between 16384 KB and 32768 KB, the time per element doubles. And then it consistently doubles each time the size of the array doubles. It suggests that the size of the CPU cache is about 16384 KB in this instance.

Memory bandwidth

You can only read and write memory up to a maximal speed. It can be difficult to measure such limits. In particular, you may need several cores in a multi-core system to achieve the best possible memory. For simplicity, let us consider maximal read memory.

Many large systems do not have a single bandwidth number. For example, many large systems rely on NUMA: NUMA stands for Non-Uniform Memory Access. In a NUMA system, each processor has its own local memory, which it can access faster than the memory of other processors.

The bandwidth also depends to some extend on the amount of memory requested. If the memory fits in CPU cache, only the first access may be expensive. A very large memory region may not fit in RAM and may require disk storage. Even if it fits in RAM, an overly large memory region might require many memory pages, and accessing all of them may cause page walking due to the limits of the translation lookaside buffer.

If the memory is accessed at random locations, it might be difficult for the system to sustain a maximal bandwidth because the system cannot predict easily where the next memory load occurs. To get the best bandwidth, you may want to access the memory linearly or according to some predictable pattern.

Let us consider the following code:

package main

import (
    "fmt"
    "time"
)

func run() float64 {
    bestbandwidth := 0.0
    arr := make([]uint8, 2*1024*1024*1024) // 4 GB
    for i := 0; i < len(arr); i++ {
        arr[i] = 1
    }
    for t := 0; t < 20; t++ {
        start := time.Now()
        acc := 0
        for i := 0; i < len(arr); i += 64 {
            acc += int(arr[i])
        }
        end := time.Now()
        if acc != len(arr)/64 {
            panic("!!!")
        }
        bandwidth := float64(len(arr)) / end.Sub(start).Seconds() / 1024 / 1024 / 1024
        if bandwidth > bestbandwidth {
            bestbandwidth = bandwidth
        }
    }
    return bestbandwidth
}

func main() {
    for i := 0; i < 10; i++ {
        fmt.Printf(" %.2f GB/s\n", run())
    }
}

The code defines two functions: run and main. The main function is the entry point for the program, and it calls the run function 10 times, printing the result each time. The run function is a custom function that measures the memory bandwidth of the system. It does this by performing the following steps:

It declares a variable called bestbandwidth and initializes it to 0.0. This variable stores the highest bandwidth value obtained during the execution of the function. It creates a slice of bytes (uint8) called arr, with a length equivalent to 4 GB. The slice is initialized with 1s. The loop will only access every 64th element of the slice, skipping the rest. Given that most systems have a cache-line size of 64 bytes or more, it is enough to touch each cache line. It calculates the bandwidth by dividing the size of the slice (in bytes) by the difference between the end and start times (in seconds), and then dividing by 1024 three times to convert the result to gigabytes per second (GB/s). The code repeats the measurement 20 times and returns the best result, to account for possible variations in the system performance. The code prints the result 10 times, to show the consistency of the measurement.

Memory latency and parallelism

Latency is often described as the time delay between the beginning of a request and the moment when you are served. Thus if you go to a restaurant, the latency you might be interested in is the time it will take before you can start eating. The latency is distinct from the throughput: a restaurant might be able to serve hundreds of customers at once, but still have high latency (long delays for each customer). If you put a lot of data on a very large disk, you can put this disk in a truck and drive the truck between two cites. It could represent a large bandwidth (much data is moved per unit of time), but the latency could be quite poor (hours). Similarly, you could shine a laser at your partner when supper is ready: the information could arrive without much delay even if you are very far away, but you are communicating little information (low throughput). One way to express this trade-off between latency and throughput is with Little’s Law: L = λW where L is the average number of elements in the system, λ is the throughput (long-term average arrival rate of new elements), and W is the latency, or the average amount of time that elements spend waiting. Thus if you want to have L customers at all times in your restaurant, and fewer customers arrive, you should serve the customers with greater delays. And so forth. Little’s law work with our memory subsystems as well: computers can sustain a maximum number of memory requests, each memory request has a latency, and there is an overall bandwidth. If latency does not improve, we can still improve bandwidth or throughput by increasing the number of requests that can be sustained concurrently. Unfortunately, system designers are often forced to make this choice, and so it is not common to see stagnant or worsening memory latencies despite fast improving memory bandwidths. A common illustration of the concept of memory latency is the traversal of a linked list. In computer science, a linked list is a data structure made of nodes, and each node is linked (by a pointer) to the next node. The nodes may not be laid out in memory consecutively, but even if they are, accessing each successive node requires a at least a small delay. On current processors, it can often take at least 3 cycles to load data from memory, even if the memory is in cache. Thus determining the length of the list by traversing the whole linked list can take time, and most of this time is just made of the successive delays. The following code benchmarks the time required to traverse a linked list made of a million nodes. Though the time varies depending on your system, it may represent a sizeable fraction of a millisecond.

package main

import (
    "fmt"
    "testing"
)

type Node struct {
    data int
    next *Node
}

func build(volume int) *Node {
    var head *Node
    for i := 0; i < volume; i++ {
        head = &Node{i, head}
    }
    return head
}

var list *Node
var N int

func BenchmarkLen(b *testing.B) {
    for n := 0; n < b.N; n++ {
        len := 0
        for p := list; p != nil; p = p.next {
            len++
        }
        if len != N {
            b.Fatalf("invalid length: %d", len)
        }
    }
}

func main() {
    N = 1000000
    list = build(N)
    res := testing.Benchmark(BenchmarkLen)
    fmt.Println("milliseconds: ", float64(res.NsPerOp())/1e6)

    fmt.Println("nanoseconds per el.", float64(res.NsPerOp())/float64(N))
}

In this code, a Node struct is defined with two fields: data is an integer representing the value stored in the node, `next is a pointer to the next node in the linked list. We could also add a pointer to the previous node, but that is not necessary in our case. The build function creates a singly linked list of nodes from an integer volume as an argument. It initializes an empty linked list (head is initially nil). It iterates from 0 to volume-1, creating a new node with value i and pointing its next to the current head. The new node becomes the new head. The function returns the final head of the linked list. The main function initializes two global variables (list and N) storing respectively the head of the list and the expected length. These values h are used by the BenchmarkLen function. This code demonstrates how to create a linked list, calculate its length, and benchmark the performance of the length calculation. Our length computation is almost entirely bounded (limited) by the memory latency, the time it takes to access the memory. The computations that we are doing (comparisons, increments) are unimportant to the performance. To illustrate our observation, we can try traversing two linked lists simultaneously, as in this example:

package main

import (
    "fmt"
    "testing"
)

type Node struct {
    data int
    next *Node
}

func build(volume int) *Node {
    var head *Node
    for i := 0; i < volume; i++ {
        head = &Node{i, head}
    }
    return head
}

var list1 *Node
var list2 *Node

var N int

func BenchmarkLen(b *testing.B) {
    for n := 0; n < b.N; n++ {
        len := 0
        for p1, p2 := list1, list2; p1 != nil && p2 != nil; p1, p2 = p1.next, p2.next {
            len++
        }
        if len != N {
            b.Fatalf("invalid length: %d", len)
        }
    }
}

func main() {
    N = 1000000
    list1 = build(N)
    list2 = build(N)

    res := testing.Benchmark(BenchmarkLen)
    fmt.Println("milliseconds: ", float64(res.NsPerOp())/1e6)

    fmt.Println("nanoseconds per el.", float64(res.NsPerOp())/float64(N))
}

If you run this new code, you might find that the benchmark results are close to the single-list ones. It is not surprising: the processor is mostly just waiting for the next node, and waiting for two nodes is not much more expensive. For this reason, when programming, you should limit memory accesses as much as possible. Use simple arrays when you can instead of linked lists or node-based tree structures. We would would like to work with arbitrarily large data structures, so that we can stress the memory access outside of the cache. Sattolo’s algorithm is a variant of the well-known random shuffle that generates a random cyclic permutation of an array or list. Sattolo’s algorithm ensures that the data is permuted using a single cycle. That is, starting with one element in a list of size n, we find that this element is moved to another position, which is itself moved to another position, and so forth, until after n moves, we end up back at our initial position. To apply Sattolo’s algorithm, given an array or list of elements, we start with an index i from 0 to n-1, where n is the length of the array. For each index i, we choose a random index j such that i < j < n. We swap the elements at indices i and j. E.g., suppose we have an array [0, 1, 2, 3, 4]. The algorithm might produce a cyclic permutation like [2, 0, 3, 1, 4]. With this algorithm, we can visit all values in an array exactly once in random order. From an array contain indexes 0 to n-1 permuted with Sattolo’s algorithm, we first load the first element, read its value, move to the corresponding index, and so forth. After `n operation, we should come back at the initial position. Because each operation involves a memory load, it is limited by memory latency. We can try to go faster with memory-level parallelism: we can pick k positions spread out in the cycle and move from these k initial positions n/k times through the cycle. Because computers can load many values in parallel, this algorithm should be faster for larger values of k. However, as k increases, we may see fewer and fewer gains because systems have limited memory-level parallelism and bandwidth.The following program implements this idea.

package main

import (
    "fmt"
    "math/rand"
    "time"
)

// makeCycle creates a cycle of a specified length starting at element 0
func makeCycle(length int) ([]uint64, []uint64) {
    array := make([]uint64, length)
    index := make([]uint64, length)
    // Create a cycle of maximum length within the big array
    for i := 0; i < length; i++ {
        array[i] = uint64(i)
    }

    // Sattolo shuffle
    for i := 0; i+1 < length; i++ {
        swapIdx := rand.Intn(length-i-1) + i + 1
        array[i], array[swapIdx] = array[swapIdx], array[i]
    }

    total := 0
    cur := uint64(0)
    for cur != 0 {
        index[total] = cur
        total++
        cur = array[cur]
    }
    return array, index
}

// setupPointers sets up pointers based on the given index
func setupPointers(index []uint64, length, mlp int) []uint64 {
    sp := make([]uint64, mlp)
    sp[0] = 0

    totalInc := 0
    for m := 1; m < mlp; m++ {
        totalInc += length / mlp
        sp[m] = index[totalInc]
    }
    return sp
}

func runBench(array []uint64, index []uint64, mlp int) time.Duration {
    length := len(array)
    sp := setupPointers(index, length, mlp)
    hits := length / mlp
    before := time.Now()
    for i := 0; i < hits; i++ {
        for m := 0; m < mlp; m++ {
            sp[m] = array[sp[m]]
        }
    }
    after := time.Now()
    return after.Sub(before)
}

func main() {
    const length = 100000000
    array, index := makeCycle(length)
    fmt.Println("Length:", length*8/1024/1024, "MB")
    base := runBench(array, index, 1)
    fmt.Println("Lanes:", 1, "Time:", base)

    for mlp := 2; mlp <= 40; mlp++ {
        t := runBench(array, index, mlp)
        fmt.Println("Lanes:", mlp, "Speedup:", fmt.Sprintf("%.1f", float64(base)/float64(t)))
    }
}

The function makeCycle creates a cycle of a specified length starting at element 0. It initializes two slices: array and index, both of type []uint64. The array slice represents the elements in the cycle. The index slice stores the indices of the elements in the cycle, so that we can more easily access a position in the cycle. The function performs the following steps. It initializes array with values from 0 to length-1. It applies Sattolo’s shuffle algorithm to the array to create a random permutation. The function returns both array and index. The function setupPointers: the function calculates the increment value (totalInc) based on the length and the number of lanes (mlp). It assigns the indices from index to sp based on the calculated increments. The function runBench benchmarks the execution time for a given number of lanes (mlp). It initializes a slice sp using setupPointers. The function iterates through the pointers in sp and updates them by following the indices in array`. It measures the execution time and returns it as a time.Duration instance. The main function first computes the running time for 1 lane, and then it reports the gains when using multiple lanes. Overall, this code generates a cycle of specified length, sets up pointers, and benchmarks the execution time for different numbers of lanes. The primary purpose seems to be exploring parallelization using multiple lanes. The runBench function simulates parallel execution by updating pointers concurrently. The speedup is calculated by comparing the execution time for different numbers of lanes. The larger the speedup, the more efficient the memory-level parallel execution. The general principle is that you can often improve the performance of a system that faces high latencies by breaking the data dependencies. Instead of putting all your data in a long chain, try to break to have no chain at all or, if you must have chains, use several smaller chains.

Superscalarity and data dependency

Most current processors are superscalar (as opposed to ‘scalar’), meaning that they can execute and retire several instructions per CPU cycles. That is, even if you have a single CPU core, there is much parallelism involved. Some processors can retire 8 instructions per cycle or more. Not all code routines benefit equally from superscalar execution. Several factors can limit your processors to few instructions per cycle. Having to wait on memory accesses is one such factor. Another common factor is data dependency: when the next instruction depends on the result of a preceding instruction… it may have to wait before it starts executing. To illustrate consider functions that compute successive differences between elements of an array (e.g., given 5,7,6, you might get the initial value 5 followed by 2 and -1), and the reverse operation which sums up all the differences to recover the original value. You may implement these functions as such:

func successiveDifferences(arr []int) {
    base := arr[0]
    for i := 1; i < len(arr); i++ {
        base, arr[i] = arr[i], arr[i]-base
    }
}

func prefixSum(arr []int) {
    for i := 1; i < len(arr); i++ {
        arr[i] = arr[i] + arr[i-1]
    }
}

Assuming that the compiler does not optimize these functions in a non-trivial manner (e.g., using SIMD instructions), we can reason relatively simply about the performance. For the successive differences, we need approximately one subtraction per element in the array. For the prefix sum, you need approximately one addition per element in the array. It looks quite similar at a glance. However, the data dependency is different. To compute the difference between any two values in the array, you do not need to have computed the preceding differences. However, the prefix sum, as we implemented it, requires us to have computed all preceding sums before the next can be computed. Let us write a small benchmarking program to test the performance difference:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func successiveDifferences(arr []int) {
    base := arr[0]
    for i := 1; i < len(arr); i++ {
        base, arr[i] = arr[i], arr[i]-base
    }
}

func prefixSum(arr []int) {
    for i := 1; i < len(arr); i++ {
        arr[i] = arr[i] + arr[i-1]
    }
}

var array []int

func BenchmarkPrefixSum(b *testing.B) {
    for n := 0; n < b.N; n++ {
        prefixSum(array)
    }
}

func BenchmarkSuccessiveDifferences(b *testing.B) {
    for n := 0; n < b.N; n++ {
        successiveDifferences(array)
    }
}

func main() {
    array = make([]int, 100)
    for i := range array {
        array[i] = rand.Int()
    }
    res2 := testing.Benchmark(BenchmarkSuccessiveDifferences)
    fmt.Println("BenchmarkSuccessiveDifferences", res2)
    res1 := testing.Benchmark(BenchmarkPrefixSum)
    fmt.Println("BenchmarkPrefixSum", res1)

}

Your result will vary depending on your system. However, you should not be surprised if the prefix sum takes more time. On an Apple system, we go the following results:

BenchmarkSuccessiveDifferences 39742334         30.04 ns/op
BenchmarkPrefixSum  8307944            142.8 ns/op

The prefix sum can be several times slower, even though it appears at a glance that it should use a comparable number of instructions. In general, you cannot trust a hasty analysis. Just because two functions appear to do a similar amount of work, does not mean that they have the same performance. Several factors must be taken into account, including data dependencies.

Branch prediction

In part because the processors are multiscalar, they have been designed to execute speculatively: when facing a branch, the processor tries to guess the direction that will be taken, and it begins the computation optimistically. When the processor makes the correct prediction, it usually improves the performance, sometimes by a large amount. However, when the processor is unable to predict accurately the branch, branch prediction may become a net negative. Indeed, when the branch is mispredicted, the processor may have to restart the computation from the point where it made the wrong prediction, an expensive process that can waste several CPU cycles. To illustrate, let us first consider a function that copies the content of an slice into another slice of the same size:

func Copy(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] = v
    }
}

A more sophisticated function may copy only the odd elements:

unc CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

We may try to copy an array that contains random integers (both odd and even), only odd integers, or only even integers. The following program illustrates:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func Copy(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] = v
    }
}

func CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

var array []uint
var dest []uint

func BenchmarkCopyOdd(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOdd(dest, array)
    }
}

func BenchmarkCopy(b *testing.B) {
    for n := 0; n < b.N; n++ {
        Copy(dest, array)
    }
}

func main() {
    array = make([]uint, 10000)
    dest = make([]uint, len(array))

    for i := range array {
        array[i] = uint(rand.Uint32())
    }
    res0 := testing.Benchmark(BenchmarkCopy)
    fmt.Println("BenchmarkCopy (random)", res0)
    res1 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (random)", res1)
    for i := range array {
        array[i] = uint(rand.Uint32()) | 1
    }
    res2 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (odd data)", res2)
    for i := range array {
        array[i] = uint(rand.Uint32()) &^ 1
    }
    res3 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (even data)", res3)
}

On an Apple system, we got the following results:

BenchmarkCopy (random)   414158       2936 ns/op
BenchmarkCopyOdd (random)    55408           19518 ns/op
BenchmarkCopyOdd (odd data)   402670          2975 ns/op
BenchmarkCopyOdd (even data)   402738         2896 ns/op

The last three timings involve the same function, only the input data differs. We find that all timings are similar in this case, except for benchmark that copies random data: it is several times slower in our tests. The much longer running time is due to the presence of an unpredictable branch in our inner loop. Observe that the same function, subject to the same volume of data, can have vastly different performance characteristics, even though the computational complexity of the function does not change: in all instances, we have linear time complexity. If we expect our data to lead to poor branch prediction, we may reduce the number of branches in the code. The resulting code might be nearly branch free or branchless. For example, we can use an arithmetic and logical expression to replace a condition copy:

func CopyOddBranchless(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] ^= uint(-(v & 1)) & (v ^ dest[i])
    }
}

Let us review the complicated expression:

  • v & 1: This operation checks if the least significant bit of v is set (i.e., if v is odd).
  • -(v & 1): This negates the result of the previous operation. If v is odd, this becomes -1; otherwise, it becomes 0. However, -1 as an unsigned integer is becomes the maximal value, the one with all of the bits set to 1.
  • v ^ dest[i]: This XORs the value of v with the corresponding element in the dest slice.
  • uint(-(v & 1)) & (v ^ dest[i]): If v is odd, it returns the XOR of v with dest[i]; otherwise, it returns 0.
  • Finally, dest[i] ^= uint(-(v & 1)) & (v ^ dest[i]) leaves dest[i] unchanged if v is even, otherwise it replaces with v using the fact that dest[i] ^ (v ^ dest[i]) == v.

We can put this function to good use in a benchmark:

package main

import (
    "fmt"
    "math/rand"
    "testing"
)

func CopyOdd(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        if v&1 == 1 {
            dest[i] = v
        }
    }
}

func CopyOddBranchless(dest []uint, arr []uint) {
    if len(dest) < len(arr) {
        panic("dest is too small")
    }
    for i, v := range arr {
        dest[i] ^= uint(-(v & 1)) & (v ^ dest[i])
    }
}

var array []uint
var dest []uint

func BenchmarkCopyOdd(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOdd(dest, array)
    }
}

func BenchmarkCopyOddBranchless(b *testing.B) {
    for n := 0; n < b.N; n++ {
        CopyOddBranchless(dest, array)
    }
}
func main() {
    array = make([]uint, 10000)
    dest = make([]uint, len(array))
    for i := range array {
        array[i] = uint(rand.Uint32())
    }
    res1 := testing.Benchmark(BenchmarkCopyOdd)
    fmt.Println("BenchmarkCopyOdd (random)", res1)
    res2 := testing.Benchmark(BenchmarkCopyOddBranchless)
    fmt.Println("BenchmarkCopyOddBranchless (random)", res2)
}

On an Apple system, we got:

BenchmarkCopyOdd (random)    60782           19254 ns/op
BenchmarkCopyOddBranchless (random)   166863          7124 ns/op

In this test, the branchless approach is much faster. We should stress that it is not always the case that branchless code is faster. In fact, we observe that in our overall test results, the branchless function is significantly slower than the original when the results are predictable (e.g., 2896 ns/op vs 7124 ns/op). In actual software, you should try to recognize where you have poorly predicted branches and act in these cases to see if a branchless approach might be faster. Thankfully, most branches are well predicted in practice in most projects.

Daniel Lemire, "Measuring your system’s performance using software (Go edition)," in Daniel Lemire's blog, March 17, 2024.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

2 thoughts on “Measuring your system’s performance using software (Go edition)”

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.