Golang Benchmarking, Part 1

Basic CPU and Memory Benchmarking with Go

Performance Benchmarks with Go's testing Package and Statistical Analysis with benchstat

---

Franco Posa

Published 2025-01-16 · Updated 2025-01-16


this document is a work in progress

Goals

We will:

  1. Write a simple benchmark test with the Go testing package
  2. Run benchmarks and analyze CPU and Memory performance
  3. Use the benchstat tool to apply statistical comparisons between benchmark results

1. Write and Run a Simple Benchmark in Go

1.1. A Real-World Example

The example we will use here is simple - a single line in Grafana’s Mimir codebase.

In a project spanning months, we had introduced enormous changes to a query request queuing algorithm and spent much of our focus on the performance of the queue in the context of the larger system of microservices. As the end of the project neared, the system performed well under load tests and we had proven out the correctness of our data structures and algorithms.

With such complex changes it is not just minor bugs we have to watch out for - small mistakes often take the form of correct, but inefficient code. Continuous profiles of the code running in our dev environments allowed us to find a few spots to improve.

While this particular snippet was far from the worst offender, it looked like it could be easy to fix. The code returns a list of strings with length 2, representing a path through a tree structure.

Original

    return append([]string{component}, tenant)

The potential problem is that we are creating a new slice just to hold the first string, then appending the second string to it, instead of just creating a slice with the two strings:

Potential Fix

    return []string{component, tenant}

However, without a benchmark we could not tell or if or how much the potential fix could improve performance. It is possible that a compiler would optimize this away, only creating a single slice, or at least a single allocation.

Reading compiler’s output is not the most common skill, and doing so still would not the answer the “how much” question, so we opted to write a quick benchmark to measure the difference.

1.2. Write a Simple Benchmark

We start with our building blocks - the simplest possible representation of the code we want to benchmark. Further complexity and changes to make it more representative of real-world usage can always come later.

Benchmark tests in Go must always start with Benchmark and take b *testing.B as their only argument.

package benchmark_example

import (
	"fmt"
	"testing"
)

type makeQueuePathFunc func(component, tenant string) []string

func baselineMakeQueuePath(component, tenant string) []string {
	return append([]string{component}, tenant)
}

func noAppendMakeQueuePath(component, tenant string) []string {
	return []string{component, tenant}
}

func BenchmarkMakeQueuePath(b *testing.B) {
	var testCases = []struct {
		name     string
		pathFunc makeQueuePathFunc
	}{
		{"baseline", baselineMakeQueuePath},
		{"noAppend", noAppendMakeQueuePath},
	}

	for _, testCase := range testCases {
		b.Run(fmt.Sprintf("func_%s", testCase.name), func(b *testing.B) {
			for i := 0; i < b.N; i++ {
				_ = testCase.pathFunc("component", "tenant")
			}
		})
	}
}

1.3. Run a Simple Benchmark & Analyze Results

Go benchmarks are run via the standard go test command, with specific CLI flags used to control benchmark behavior.

At the most basic, we just need the -bench flag:

-bench regexp
    Run only those benchmarks matching a regular expression.
    By default, no benchmarks are run.
    To run all benchmarks, use '-bench .' or '-bench=.'.
...

Flags for go test are also recognized when with the prefix test, where -bench becomes -test.bench. The test prefix is required when running a pre-compiled test binary, so it is easiest to get in the habit of always using the prefix.

From the root directory of our package, we run go test -test.bench=.:

[~/repos/benchmark-example] % go test -test.bench=.
goos: linux
goarch: amd64
pkg: benchmark_example
cpu: AMD Ryzen 7 PRO 6860Z with Radeon Graphics
BenchmarkMakeQueuePath/func_baseline-16    15183444    67.30 ns/op
BenchmarkMakeQueuePath/func_noAppend-16    42104808    29.80 ns/op
PASS
ok      benchmark_example       2.393s

The output of the benchmark shows us our test names, the number of times each test was run, the average time taken for each test in nanoseconds, and the total time taken.

In this case, we are also interested in the memory allocations of each test scenario. We can add this information to the benchmark output with the -benchmem flag:

-benchmem
    Print memory allocation statistics for benchmarks.
    Allocations made in C or using C.malloc are not counted.

This time, we run go test -test.bench=. -test.benchmem:

[~/repos/benchmark-example] % go test -test.bench=. -test.benchmem
goos: linux
goarch: amd64
pkg: benchmark_example
cpu: AMD Ryzen 7 PRO 6860Z with Radeon Graphics
BenchmarkMakeQueuePath/func_baseline-16    15659572    67.82 ns/op    48 B/op    2 allocs/op
BenchmarkMakeQueuePath/func_noAppend-16    38129188    30.20 ns/op    32 B/op    1 allocs/op
PASS
ok      benchmark_example       2.332s

This benchmark already shows us that the noAppend function is faster than the baseline using append - just over twice as fast with an average of ~30 ns/op vs. baseline at ~68 ns/op. Further, as we suspected, the baseline function using append is making two memory allocations: we can infer this is one allocation for each slice declared. Turns out the compiler is not that smart.

This is great news! We have a simple change that can cut the CPU time spent in this hotspot in half. However, we still have more we can do to gain more information from the benchmarks and increase our confidence that are results will carry over to the real world.

2. Improve the Benchmark Run Configuration: Controlling the Benchmark Iterations

2.1. Understand Go’s Benchmark b.N

Look at our benchmark output - why did the noAppend test run more than twice as many iterations as baseline? The answer lies in b.N.

Recall our benchmark function:

b.Run(fmt.Sprintf("func_%s", testCase.name), func(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_ = testCase.pathFunc("component", "tenant")
	}
})

The Go benchmarking docs give us a vague answer:

The benchmark function must run the target code b.N times. It is called multiple times with b.N adjusted until the benchmark function lasts long enough to be timed reliably.

This is all well and good, but “timed reliably” does not tell us much. It notably does not indicate that the number of iterations is chosen to give us a defensible comparison between our benchmark scenarios.

Benchmark outputs are not always as clear as one option running twice as fast as another. Recall your introductory statistics classes if you ever took them, or paid attention: when the difference between the averages of two datasets is small, we can gain more confidence in the statistical significance of the difference by collecting more data points.

2.2. Control Benchmark Iterations with -benchtime

We could just delete the usage of b.N from our code and hardcode the number of iterations into the test, but it is far more convenient for us and any other engineers we work with to avoid this.

Instead we can control this behavior with the -benchtime flag, and each user can easily change the number of iterations to suit their needs or just omit the option to explicitly control iterations altogether.

-benchtime t
    Run enough iterations of each benchmark to take t, specified
    as a time.Duration (for example, -benchtime 1h30s).
    The default is 1 second (1s).
    The special syntax Nx means to run the benchmark N times
    (for example, -benchtime 100x).

With an emphasis collecting more datapoints to increase our confidence in the results, we can choose a number iterations greater than the ~40 million that Go has chosen for us in pursuit of its goal that “the benchmark function lasts long enough to be timed reliably”.

Since our benchmark is small and fast, it does not hurt to overshoot on iterations - why not a nice round 64 million?

[~/repos/benchmark-example] % go test -test.bench=. -test.benchtime=64000000x -test.benchmem
goos: linux
goarch: amd64
pkg: benchmark_example
cpu: AMD Ryzen 7 PRO 6860Z with Radeon Graphics
BenchmarkMakeQueuePath/func_baseline-16    64000000    66.19 ns/op    48 B/op    2 allocs/op
BenchmarkMakeQueuePath/func_noAppend-16    64000000    32.33 ns/op    32 B/op    1 allocs/op
PASS
ok      benchmark_example       6.311s