Fixing Zig benchmark where std.mem.doNotOptimizeAway was ignored
This is my second devlog for bench, a fast and accurate benchmarking library for Zig. Yet another hobby project of mine.
Previously, I shared how I fixed the inaccuracy of the metrics. This time, I tested it against a classic algorithm: Fibonacci. That is where I found this new bug.
I wrote a simple recursive implementation and an iterative one. Then I set up the benchmark.
const std = @import("std");
const bench = @import("bench");
fn fibNaive(n: u64) u64 {
if (n <= 1) return n;
return fibNaive(n - 1) + fibNaive(n - 2);
}
fn fibIterative(n: u64) u64 {
if (n == 0) return 0;
var a: u64 = 0;
var b: u64 = 1;
for (2..n + 1) |_| {
const c = a + b;
a = b;
b = c;
}
return b;
}
pub fn main() !void {
const allocator = std.heap.smp_allocator;
const opts = bench.Options{
.sample_size = 100,
.warmup_iters = 3,
};
const m_naive = try bench.run(allocator, "fibNaive", fibNaive, .{@as(u64, 30)}, opts);
const m_iter = try bench.run(allocator, "fibIterative", fibIterative, .{@as(u64, 30)}, opts);
try bench.report(.{
.metrics = &.{ m_naive, m_iter },
.baseline_index = 0, // naive as baseline
});
}I compiled it in ReleaseFast mode and ran it. The result confused me.
$ zig build fibonacci
Benchmark Summary: 2 benchmarks run
├─ fibNaive 0.43ns [baseline]
│ └─ cycles: 2 instructions: 4 ipc: 2.00 miss: 0
└─ fibIterative 0.43ns 1.00x faster
└─ cycles: 2 instructions: 4 ipc: 2.00 miss: 00.43 nanoseconds for both implementations.
For context, a modern CPU clock cycle is roughly 0.25ns. My benchmark claimed
that calculating the 30th Fibonacci number was happening in less than 2 CPU
cycles. The fibNaive should at least performed over a million recursive calls.
Unless I accidentally invented a new type of physics, the compiler was lying to me.
Finding the Root Cause
The function fibNaive is a “pure function”. Its output depends only on its
input. I was passing 30 as a constant literal.
The Zig compiler uses LLVM backend, and LLVM is very smart. It saw that the
input was always 30. It decided to run the calculation during compilation.
This is called
Constant Folding & Propagation.
I needed to prove this. I decided to look at the assembly code.
Getting the assembly output was a bit tricky. I could not just pass flags to
zig build. I had to use objdump on the final binary.
objdump -d -C --no-show-raw-insn zig-out/bin/fibonacci-bench | lessI searched for the main loop and found the smoking gun.
mov $0xcb228,%eax0xcb228 in hex is 832040 in decimal. That is the 30th Fibonacci number. The
compiler deleted my function entirely and just replaced it with the answer. I
was benchmarking how fast the CPU could move a number into a register.
The Failed Attempts
I thought this would be an easy fix. The standard advice in benchmarking is to
use a “black box” function that tricks the compiler. In Zig, this is
std.mem.doNotOptimizeAway.
var args = .{ 30 };
std.mem.doNotOptimizeAway(&args);
try execute(function, args);I ran it again. The result was still 0.43ns.
The problem is that doNotOptimizeAway prevents the compiler from deleting the
variable, but it does not hide the value. The compiler saw that args was
initialized to 30. It saw that doNotOptimizeAway read the memory but did not
change it. So when I called execute, the compiler still knew the value was
30.
The Fix: volatile
To defeat the compiler optimizations, I had to be aggressive. I needed to force
a volatile load.
A volatile load tells the compiler that the memory at a specific address might
change at any time. It could be changed by hardware, another thread, or cosmic
rays. The compiler is forced to read the value from memory right before using
it. It cannot assume the value is 30.
The Zig documentation actually warns against this:
If you see code that is using volatile for something other than Memory Mapped Input/Output, it is probably a bug.
This is good advice for application logic (volatile is not thread-safe). But for benchmarking, we want to trick the compiler. We are effectively lying to LLVM, pretending our stack variable is a hardware register that could change unpredictably.
Implementing this in Zig was interesting because of types. The literal 30 is a
comptime_int. Volatile pointers need a real runtime type like u64.
I had to write some code to inspect the function arguments and build a runtime tuple.
/// Constructs the runtime argument tuple based on function parameters and input args.
fn createRuntimeArgs(function: anytype, args: anytype) RuntimeArgsType(@TypeOf(function), @TypeOf(args)) {
const TupleType = RuntimeArgsType(@TypeOf(function), @TypeOf(args));
var runtime_args: TupleType = undefined;
// We only need the length here to iterate
const fn_params = getFnParams(@TypeOf(function));
inline for (0..fn_params.len) |i| {
runtime_args[i] = args[i];
}
return runtime_args;
}
/// Computes the precise Tuple type required to hold the arguments.
fn RuntimeArgsType(comptime FnType: type, comptime ArgsType: type) type {
const fn_params = getFnParams(FnType);
const args_fields = @typeInfo(ArgsType).@"struct".fields;
comptime var types: [fn_params.len]type = undefined;
inline for (fn_params, 0..) |p, i| {
if (p.type) |t| {
types[i] = t;
} else {
types[i] = args_fields[i].type;
}
}
return std.meta.Tuple(&types);
}
/// Helper to unwrap function pointers and retrieve parameter info
fn getFnParams(comptime FnType: type) []const std.builtin.Type.Fn.Param {
return @typeInfo(unwrapFnType(FnType)).@"fn".params;
}Then I use it like this inside the run function:
var runtime_args = createRuntimeArgs(function, args);
const volatile_args_ptr: *volatile @TypeOf(runtime_args) = &runtime_args;
for (0..options.warmup_iters) |_| {
try execute(function, volatile_args_ptr.*);
}By reading from volatile_args_ptr, LLVM treats the data as “unknown”. It has
no choice but to generate the code to call the function.
The Result
After applying the fix, the numbers finally made sense.
$ zig build fibonacci
Benchmark Summary: 2 benchmarks run
├─ fibNaive 1.76ms [baseline]
│ └─ cycles: 8.1M instructions: 27.8M ipc: 3.41 miss: 1
└─ fibIterative 3.83ns 459134.36x faster
└─ cycles: 16 instructions: 83 ipc: 5.18 miss: 0fibNaive went from 0.43ns to 1.76ms. The recursive version is slow, and the
iterative version is fast.
Checking the assembly again confirmed it. The magic number was gone, replaced by
actual call instructions.
This was a good reminder that compilers are often smarter than we expect. When benchmarking, you have to fight for your right to run slow code.
References
Update: 2025-12-09
After publishing this, I shared it on the
Ziggit forum.
The community pointed out that while volatile works, it forces a hardware
memory load which technically changes what is being measured (adding memory
latency that wouldn’t exist in a hot loop).
It turns out std.mem.doNotOptimizeAway can work, but it requires a very
specific setup to actually blind the compiler.
I created a simplified reproduction to prove exactly when constant folding happens. You can verify this on Godbolt.
const std = @import("std");
export fn fib(n: u64) u64 {
if (n == 0) return 0;
var a: u64 = 0;
var b: u64 = 1;
for (2..n + 1) |_| {
const c = a + b;
a = b;
b = c;
}
return b;
}
export fn run_with_u64() void {
const n: u64 = 30;
std.mem.doNotOptimizeAway(n); // Passes by value
const result = fib(n);
std.mem.doNotOptimizeAway(result);
}
export fn run_with_mutable_pointer() void {
const n: u64 = 30;
var x = n; // Create mutable copy
std.mem.doNotOptimizeAway(&x); // Pass address to "clobber" memory
const result = fib(x);
std.mem.doNotOptimizeAway(result);
}The assembly output for run_with_u64 still shows the hardcoded answer
(832040), meaning DNOA failed:
run_with_u64:
push rbp
mov rbp, rsp
mov eax, 30
mov eax, 832040 ; <=== Answer hardcoded.
pop rbp
retBut run_with_mutable_pointer generates real logic:
run_with_mutable_pointer:
push rbp
mov rbp, rsp
mov qword ptr [rbp - 8], 30
; ... loop logic ...
add rcx, -2
mov eax, esi
and eax, 7
cmp rcx, 7
; ...The key is passing a pointer to a mutable (var) variable. This hits
the inline assembly “memory” clobber constraint, forcing LLVM to assume the
value at that address has changed, without the hardware cost of volatile.
Based on this, I updated bench to remove the volatile cast and use the
mutable pointer trick instead.
diff --git a/examples/fibonacci.zig b/examples/fibonacci.zig
index 489eb81..9434580 100644
--- a/examples/fibonacci.zig
+++ b/examples/fibonacci.zig
@@ -26,8 +26,8 @@ pub fn main() !void {
.sample_size = 100,
.warmup_iters = 3,
};
- const m_naive = try bench.run(allocator, "fibNaive", fibNaive, .{@as(u64, 30)}, opts);
- const m_iter = try bench.run(allocator, "fibIterative", fibIterative, .{@as(u64, 30)}, opts);
+ const m_naive = try bench.run(allocator, "fibNaive", fibNaive, .{30}, opts);
+ const m_iter = try bench.run(allocator, "fibIterative", fibIterative, .{30}, opts);
try bench.report(.{
.metrics = &.{ m_naive, m_iter },
diff --git a/src/root.zig b/src/root.zig
index 05f2620..72ae0b8 100644
--- a/src/root.zig
+++ b/src/root.zig
@@ -47,10 +47,10 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
// ref: https://pyk.sh/blog/2025-12-08-bench-fixing-constant-folding
var runtime_args = createRuntimeArgs(function, args);
- const volatile_args_ptr: *volatile @TypeOf(runtime_args) = &runtime_args;
+ std.mem.doNotOptimizeAway(&runtime_args);
for (0..options.warmup_iters) |_| {
- try execute(function, volatile_args_ptr.*);
+ try execute(function, runtime_args);
}
// We need to determine a batch_size such that the total execution time of the batch
@@ -63,7 +63,7 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
while (true) {
timer.reset();
for (0..batch_size) |_| {
- try execute(function, volatile_args_ptr.*);
+ try execute(function, runtime_args);
}
const duration = timer.read();
@@ -89,7 +89,7 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
for (0..options.sample_size) |i| {
timer.reset();
for (0..batch_size) |_| {
- try execute(function, volatile_args_ptr.*);
+ try execute(function, runtime_args);
}
const total_ns = timer.read();
// Average time per operation for this batch
@@ -142,7 +142,7 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
try perf.capture();
for (0..options.sample_size) |_| {
for (0..batch_size) |_| {
- try execute(function, volatile_args_ptr.*);
+ try execute(function, runtime_args);
}
}
try perf.stop();
@@ -168,6 +168,7 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
inline fn execute(function: anytype, args: anytype) !void {
const FnType = unwrapFnType(@TypeOf(function));
const return_type = @typeInfo(FnType).@"fn".return_type.?;
+
// Conditional execution based on whether the function can fail
if (@typeInfo(return_type) == .error_union) {
const result = try @call(.auto, function, args);| Tags | zig , bench |
|---|