Golang Benchmarking, Part 1

Basic CPU and Memory Benchmarking & Analysis with Go

Performance Benchmarks with Go's testing Package and Statistical Tests with benchstat

---

Franco Posa

Published 2025-01-25 · Updated 2025-02-05


Goals

We will:

  1. Write & run a simple performance benchmark test with the Go testing package
  2. Apply statistical tests between benchmark results with the benchstat tool

0. Prerequisites

0.1 Install benchstat

Install the benchstat tool into your $GOPATH/bin:

% go install golang.org/x/perf/cmd/benchstat@latest

Then check that benchstat is available in your $PATH:

% which benchstat
~/go/bin/benchstat

If the command is not found, then you may need to add the $GOPATH/bin directory to your $PATH:

% export PATH=$PATH:$GOPATH/bin

1. Write and Run a Simple Benchmark in Go

1.1. A Real-World Example

The example we will use here is simple - a single line in Grafana’s Mimir codebase.

In a project spanning months, we had introduced enormous changes to a query request queuing algorithm and spent much of our focus on the performance of the queue in the context of the larger system of microservices.

With such complex changes, it is inevitable that some minor issues make it past our test suites and code reviews. We analyzed continuous profiles of the new code running in our dev environments and looked for any performance hotspots that could be improved.

While this particular snippet was far from the worst offender, it looked like it could be easy to fix. The code returns a list of strings with length 2, representing a path through a tree structure.

Original

    return append([]string{component}, tenant)

The apparent issue is that we create a new slice just to hold the first string, then immediately create new slice with the append operations, instead of just creating a single slice with the two strings:

Potential Fix

    return []string{component, tenant}

However, without a benchmark we could not tell or if or how much the potential fix could improve performance.

1.2. Write a Simple Benchmark

We start with our building blocks - the simplest possible representation of the code we want to benchmark. Further complexity and changes to make it more representative of real-world usage can always come later.

Benchmark test names in Go must always start with Benchmark and take b *testing.B as their only argument. Also note the use of fmt.Sprintf("func=%s", testCase.name) to name each scenario within the benchmark. This is a standardized key=value format for differentiating Go benchmark scenarios, which allows the output to work with analysis tools like benchstat.

package benchmark_example

import (
	"fmt"
	"testing"
)

type queuePathFunc func(component, tenant string) []string

func baselineQueuePath(component, tenant string) []string {
	return append([]string{component}, tenant)
}

func noAppendQueuePath(component, tenant string) []string {
	return []string{component, tenant}
}

func BenchmarkQueuePath(b *testing.B) {
	var testCases = []struct {
		name     string
		pathFunc queuePathFunc
	}{
		{"baseline", baselineQueuePath},
		{"noAppend", noAppendQueuePath},
	}

	for _, testCase := range testCases {
		b.Run(fmt.Sprintf("func=%s", testCase.name), func(b *testing.B) {
			for i := 0; i < b.N; i++ {
				_ = testCase.pathFunc("component", "tenant")
			}
		})
	}
}

1.3. Run the Benchmark & View Results

Go benchmarks are run via the standard go test command, with specific CLI flags used to control benchmark behavior.

At the most basic, we just need the -bench flag:

-bench regexp
    Run only those benchmarks matching a regular expression.
    By default, no benchmarks are run.
    To run all benchmarks, use '-bench .' or '-bench=.'.
...

We also do not want to waste time running any non-benchmark tests. Like the -bench flag for benchmarks, the -run flag selects tests to run by regex, so we want a regex that will match no tests: '^$'. The ^ pattern matches the beginning of a line, and $ matches the end of a line, so the full ^$ pattern will only match a completely empty line, not any test names. The regex pattern is wrapped in single quotes to avoid shell expansion.

Flags for go test are also recognized when with the prefix test, where -bench becomes -test.bench. The test prefix is required when running a pre-compiled test binary, so it is easiest to get in the habit of always using the prefix.

From the root directory of our package, we run go test -test.run='^$' -test.bench=.:

[~/repos/benchmark-example] % go test -test.run='^$' -test.bench=.
goos: linux
goarch: amd64
pkg: benchmark_example
cpu: AMD Ryzen 7 PRO 6860Z with Radeon Graphics
BenchmarkQueuePath/func=baseline-16    22597748    52.15 ns/op
BenchmarkQueuePath/func=noAppend-16    39801882    25.97 ns/op
PASS
ok      benchmark_example       2.304s

The output of the benchmark shows us our test names, the number of times each test was run, the average time taken for each test in nanoseconds, and the total time taken.

In this case, we are also interested in the memory allocations of each test scenario. We can add this information to the benchmark output with the -benchmem flag:

-benchmem
    Print memory allocation statistics for benchmarks.
    Allocations made in C or using C.malloc are not counted.

This time, we run go test -test.run='^$' -test.bench=. -test.benchmem:

[~/repos/benchmark-example] % go test -test.run='^$' -test.bench=. -test.benchmem
goos: linux
goarch: amd64
pkg: benchmark_example
cpu: AMD Ryzen 7 PRO 6860Z with Radeon Graphics
BenchmarkQueuePath/func=baseline-16    22136233    52.31 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=noAppend-16    41793476    26.22 ns/op    32 B/op    1 allocs/op
PASS
ok      benchmark_example       2.345s

This benchmark already shows us that the noAppend function is faster than the baseline using append - around twice as fast with an average of ~26 ns/op vs. baseline at ~52 ns/op. Further, as we suspected, the baseline function using append is making two memory allocations: we can infer this is one allocation for each slice declared.

This is great news! We have a simple change that can cut the CPU time spent in this hotspot in half. However, not all benchmark results are this clear - often there is a much smaller difference which requires a statistical test to determine if the difference is significant. We have more we can do to increase our confidence that are results will carry over to the real world.

2. Control Benchmark Iterations

2.1. Understand Go’s Benchmark b.N

Look at our benchmark output - why did the noAppend test run more than twice as many iterations as baseline? The answer lies in b.N.

Recall our benchmark function:

b.Run(fmt.Sprintf("func=%s", testCase.name), func(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_ = testCase.pathFunc("component", "tenant")
	}
})

The Go benchmarking docs give us a vague answer:

The benchmark function must run the target code b.N times. It is called multiple times with b.N adjusted until the benchmark function lasts long enough to be timed reliably.

We have all sorts of reasons to want to control this number of iterations in our benchmark runs. We can set a specific round number of iterations just to have a clean and consistent dataset, reduce iterations to speed up the process for long and complex benchmarks, or increasing the number of iterations to collect more data points and increase confidence in the results.

2.2. Control Benchmark Iterations with Test Flags

2.2.1. The -test.benchtime Flag

We could just delete the usage of b.N from our code and hardcode the number of iterations into the test, but it is far more convenient for us and any other engineers we work with to avoid this.

Instead we can control this behavior with the -benchtime flag, and each user can easily change the number of iterations to suit their needs.

-benchtime t
    Run enough iterations of each benchmark to take t, specified
    as a time.Duration (for example, -benchtime 1h30s).
    The default is 1 second (1s).
    The special syntax Nx means to run the benchmark N times
    (for example, -benchtime 100x).

In our test runs up to this, point Go has chosen ~40 million iterations in pursuit of its goal that “the benchmark function lasts long enough to be timed reliably”.

With an emphasis collecting more datapoints to increase our confidence in the results, we should choose a larger number. Since our benchmark is small and fast, it does not hurt to overshoot a bit iterations - why not a nice round 48 million?

Add the flag -test.benchtime=48000000x to our go test command:

[~/repos/benchmark-example] % go test \
  -test.run='^$' \
  -test.bench=. \
  -test.benchtime=64000000x \
  -test.benchmem
goos: linux
goarch: amd64
pkg: benchmark_example
cpu: AMD Ryzen 7 PRO 6860Z with Radeon Graphics
BenchmarkQueuePath/func=baseline-16    48000000    52.13 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000    26.07 ns/op    32 B/op    1 allocs/op
PASS
ok      benchmark_example       3.759s

Now we have increased the iterations beyond what Go has chosen to ensure the scenarios are “timed reliably”, but Go still does not provide us with a way to make a statistical comparison between the two scenarios.

We will introduce the benchstat tool which is designed for that exact purpose, but first we must produce more data. Like any statistical test, benchstat requires us to have several samples for each benchmark scenario, where each sample is a full set of results from a single run of the benchmark suite.

To collect these samples, we just need to run the same suite multiple times and save the results to a file.

2.2.2. The -test.count Flag and Saving Benchmark Results with tee

The tee command duplicates shell input to both standard output and a file. In this case we can use tee to watch the progression of the benchmark runs in the terminal at the same time that the results are written to a file for benchstat to analyze. The -a/--append flag can also be used to ensure that each new run of tee appends the new results to the file rather than overwriting any existing records.

The final piece is adding -test.count to give benchstat enough samples to work with:

[~/repos/benchmark-example] % go test \
  -test.run='^$' \
  -test.bench=. \
  -test.benchtime=48000000x \
  -test.count=10 \
  -test.benchmem | tee -a benchmark-queue-path.txt
goos: linux
goarch: amd64
pkg: benchmark_example
cpu: AMD Ryzen 7 PRO 6860Z with Radeon Graphics
BenchmarkQueuePath/func=baseline-16    48000000       52.54 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       52.90 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       53.01 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       53.11 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       52.78 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       55.26 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       53.58 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       53.20 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       53.15 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=baseline-16    48000000       54.02 ns/op    48 B/op    2 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.61 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.60 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.39 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.30 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       27.16 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.97 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.59 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.43 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.19 ns/op    32 B/op    1 allocs/op
BenchmarkQueuePath/func=noAppend-16    48000000       26.68 ns/op    32 B/op    1 allocs/op
PASS
ok      benchmark_example       38.390s

3. Apply Statistical Tests with benchstat

By default, benchstat is set up to compare benchmark results with the same name across multiple files - this case is tailored for comparing identical benchmark suite names across two different versions of a codebase.

In our case, we have a single version of the codebase, and we want to compare benchmark scenarios with different names. The documentation for this is a bit hard to understand, but a working example will help.

We have the scenario names BenchmarkQueuePath/func=baseline-16 and BenchmarkQueuePath/func=noAppend-16, and we want to compare the samples for func=baseline to samples for func=noAppend. Our use of the standard key=value format in the benchmark names will make possible with benchstat. The -col option can take a list of keys across which to compare samples - in our case, the key is just /func.

This only works if our scenario names use the key=value format!

[~/repos/benchmark-example] % benchstat -col /func benchmark-queue-path.txt
goos: linux
goarch: amd64
pkg: benchmark_example
cpu: AMD Ryzen 7 PRO 6860Z with Radeon Graphics
             │  baseline   │              noAppend               │
             │   sec/op    │   sec/op     vs base                │
QueuePath-16   53.13n ± 2%   26.59n ± 1%  -49.94% (p=0.000 n=10)

             │  baseline  │              noAppend              │
             │    B/op    │    B/op     vs base                │
QueuePath-16   48.00 ± 0%   32.00 ± 0%  -33.33% (p=0.000 n=10)

             │  baseline  │              noAppend              │
             │ allocs/op  │ allocs/op   vs base                │
QueuePath-16   2.000 ± 0%   1.000 ± 0%  -50.00% (p=0.000 n=10)

This is, again, great news! With p-values of 0.000 the statistical tests show that there is a statistically significant difference in performance. The noAppend function is faster, uses less memory than the baseline function - by a long shot.

4. Conclusion

We now have a basic understanding of how write, run, and analyze benchmarks in Go.

Benchmarking is an often-overlooked part of engineering teams’ development workflows, done ad-hoc only when a large performance issue is suspected, or not at all. However, like any other habit, a bit of discipline and repetition can make it a natural part of our workflow. Proactive benchmarking can help us catch performance issues early, and even prevent them from ever reaching production.