Chapter 11: Benchmarking & Profiling
Benchmarking & Profiling
Section titled “Benchmarking & Profiling”Go provides excellent tools for measuring and improving performance. Performance optimization is a journey from measurement to understanding to improvement. Go’s tooling makes this journey straightforward and scientific.
Performance matters in Go applications, but premature optimization wastes time. The Go philosophy: write clear code first, measure to find real bottlenecks, then optimize what matters. Go’s runtime is already fast - most code doesn’t need optimization. But when performance counts (high-throughput services, real-time systems, resource-constrained environments), Go gives you the tools to understand and improve it.
This chapter covers the essential performance tools: benchmarks for measurement, pprof for profiling CPU and memory usage, and the trace tool for understanding concurrency. You’ll learn to write meaningful benchmarks, interpret profiling data, identify common performance bottlenecks, and apply targeted optimizations. The goal isn’t making everything fast - it’s making the right things fast enough.
Writing Benchmarks
Section titled “Writing Benchmarks”Understanding Benchmarks
Section titled “Understanding Benchmarks”Benchmarks measure code performance scientifically. Instead of guessing which approach is faster, you measure both and let data decide. Go’s testing package makes benchmarking as easy as writing tests - benchmarks live alongside tests in _test.go files with the same tooling.
A benchmark function accepts *testing.B and runs the code being measured b.N times. The testing framework automatically determines an appropriate N to get statistically significant results - it starts small and increases until the benchmark runs long enough to measure accurately. You don’t choose N; Go does.
The key insight: benchmarks measure relative performance, not absolute. Comparing two approaches in the same benchmark environment reveals which is faster, but the exact numbers depend on your hardware. Focus on ratios (2x faster, 50% fewer allocations) rather than absolute values (234ns per operation).
Benchmark functions start with Benchmark and use *testing.B:
Benchmark Best Practices
Section titled “Benchmark Best Practices”// In a _test.go file:
func BenchmarkConcat(b *testing.B) { strs := []string{"hello", "world", "foo", "bar"}
// Reset timer after setup b.ResetTimer()
for i := 0; i < b.N; i++ { concatWithBuilder(strs) }}
// Run with: go test -bench=. -benchmem// Output: BenchmarkConcat-8 5000000 234 ns/op 48 B/op 2 allocs/opMemory Benchmarking
Section titled “Memory Benchmarking”Why Memory Matters
Section titled “Why Memory Matters”Memory allocations are often the hidden performance killer in Go programs. Every allocation means work for the garbage collector. Frequent allocations in hot code paths can trigger GC more often, causing latency spikes and reducing throughput. Understanding allocation patterns is as important as understanding CPU usage.
The -benchmem flag adds allocation statistics to benchmark output: bytes allocated per operation and number of allocations. These numbers reveal optimization opportunities. An algorithm might seem fast but allocate heavily - reducing allocations often speeds up the algorithm and reduces GC pressure simultaneously.
Allocation-free code is the gold standard for performance-critical paths. This doesn’t mean avoiding all allocations - it means being intentional about them. Preallocate buffers, reuse objects with sync.Pool, work with []byte instead of strings, and return values instead of pointers for small types. These techniques dramatically reduce allocation rates.
Using pprof
Section titled “Using pprof”Understanding pprof
Section titled “Understanding pprof”pprof is Go’s built-in profiler for identifying CPU hotspots and memory bottlenecks. Unlike benchmarks that measure specific functions in isolation, pprof analyzes entire programs to show where time is spent and memory is allocated. It answers the critical question: “What should I optimize?”
CPU profiling samples your program periodically (100 times per second by default) to record which functions are executing. After collection, pprof aggregates the data to show time spent per function, including time spent in called functions (cumulative) versus time in the function itself (flat). The top functions by cumulative time are your optimization targets.
Memory profiling tracks allocations, showing which functions allocate the most bytes and objects. This reveals unexpected allocation patterns - maybe a function called rarely allocates huge amounts, or a frequently-called function has small but numerous allocations. Both problems have different solutions, and pprof helps you identify them.
# CPU profilinggo test -cpuprofile=cpu.prof -bench=.go tool pprof cpu.prof
# Memory profilinggo test -memprofile=mem.prof -bench=.go tool pprof mem.prof
# Common pprof commands:# top10 - show top 10 functions# list funcName - show source with annotations# web - open interactive graph in browserHTTP pprof
Section titled “HTTP pprof”import _ "net/http/pprof"
func main() { go func() { log.Println(http.ListenAndServe("localhost:6060", nil)) }() // Your application...}
// Access at:// http://localhost:6060/debug/pprof/// http://localhost:6060/debug/pprof/heap// http://localhost:6060/debug/pprof/goroutineCommon Optimizations
Section titled “Common Optimizations”Applying Targeted Improvements
Section titled “Applying Targeted Improvements”Once profiling reveals bottlenecks, you apply targeted optimizations. The following patterns appear repeatedly in performance-critical Go code. They’re not appropriate everywhere - use them where profiling shows they matter, not preemptively.
These optimizations share a theme: reduce allocations, minimize copying, and leverage Go’s efficient primitives. sync.Pool reuses temporary objects. Preallocation eliminates growth overhead. Value receivers avoid pointer indirection for small types. Working with bytes instead of strings avoids conversions. Each technique has specific use cases where it shines.
The art of optimization is knowing when to apply these patterns. A function called once per request doesn’t need sync.Pool. A slice of 10 items doesn’t need preallocation. A 64-byte struct passed by value is fine. Profile first, understand the bottleneck, then apply the appropriate technique.
Avoiding Allocations
Section titled “Avoiding Allocations”Trace Tool
Section titled “Trace Tool”Understanding Execution Traces
Section titled “Understanding Execution Traces”The trace tool visualizes program execution over time, showing goroutine scheduling, GC activity, and system interactions. Unlike pprof which aggregates data, traces show the timeline of events - you can see exactly when goroutines run, block, and communicate. This is invaluable for understanding concurrency issues.
Traces excel at revealing concurrency problems that profiling misses. Is your program underutilizing CPUs because goroutines block on channels? Are goroutines creating contention for locks? Is the GC pausing your application at critical moments? The trace timeline makes these patterns visible.
The interactive trace viewer shows multiple timelines: per-processor goroutine execution, heap size, GC events, and goroutine creation/blocking. Click events to see details, zoom in on interesting periods, and correlate across timelines. Common insights: goroutines spending too much time blocked, inadequate parallelism, or GC triggering too frequently. The trace points to root causes that profiling data alone can’t reveal.
# Generate tracego test -trace=trace.out -bench=.
# View tracego tool trace trace.outThe trace shows:
- Goroutine execution timeline
- GC events
- Syscalls
- Network blocking
Key Takeaways
Section titled “Key Takeaways”- Benchmark first - measure before optimizing
- Use -benchmem - track allocations, not just time
- Profile in production - pprof over HTTP
- Preallocate - slices and maps when size is known
- Avoid allocations - sync.Pool, strconv, []byte
- Trace for concurrency - go tool trace
Exercise
Section titled “Exercise”Optimize a Slow Function
Given a function that counts word frequencies in a text, identify and fix the performance bottlenecks. The optimized version should be at least 3x faster.
Next up: Chapter 12: Patterns & Gotchas