Fixing Zig benchmark where std.mem.doNotOptimizeAway was ignored

This is my second devlog for bench, a fast and accurate benchmarking library for Zig. Yet another hobby project of mine.

Previously, I shared how I fixed the inaccuracy of the metrics. This time, I tested it against a classic algorithm: Fibonacci. That is where I found this new bug.

I wrote a simple recursive implementation and an iterative one. Then I set up the benchmark.

Zig
const std = @import("std");
const bench = @import("bench");

fn fibNaive(n: u64) u64 {
    if (n <= 1) return n;
    return fibNaive(n - 1) + fibNaive(n - 2);
}

fn fibIterative(n: u64) u64 {
    if (n == 0) return 0;

    var a: u64 = 0;
    var b: u64 = 1;
    for (2..n + 1) |_| {
        const c = a + b;
        a = b;
        b = c;
    }

    return b;
}

pub fn main() !void {
    const allocator = std.heap.smp_allocator;
    const opts = bench.Options{
        .sample_size = 100,
        .warmup_iters = 3,
    };
    const m_naive = try bench.run(allocator, "fibNaive", fibNaive, .{@as(u64, 30)}, opts);
    const m_iter = try bench.run(allocator, "fibIterative", fibIterative, .{@as(u64, 30)}, opts);

    try bench.report(.{
        .metrics = &.{ m_naive, m_iter },
        .baseline_index = 0, // naive as baseline
    });
}

I compiled it in ReleaseFast mode and ran it. The result confused me.

Shell
$ zig build fibonacci
Benchmark Summary: 2 benchmarks run
├─ fibNaive         0.43ns       [baseline]
  └─ cycles: 2 instructions: 4 ipc: 2.00       miss: 0
└─ fibIterative     0.43ns       1.00x faster
   └─ cycles: 2 instructions: 4 ipc: 2.00       miss: 0

0.43 nanoseconds for both implementations.

For context, a modern CPU clock cycle is roughly 0.25ns. My benchmark claimed that calculating the 30th Fibonacci number was happening in less than 2 CPU cycles. The fibNaive should at least performed over a million recursive calls.

Unless I accidentally invented a new type of physics, the compiler was lying to me.

Finding the Root Cause

The function fibNaive is a “pure function”. Its output depends only on its input. I was passing 30 as a constant literal.

The Zig compiler uses LLVM backend, and LLVM is very smart. It saw that the input was always 30. It decided to run the calculation during compilation. This is called Constant Folding & Propagation.

I needed to prove this. I decided to look at the assembly code.

Getting the assembly output was a bit tricky. I could not just pass flags to zig build. I had to use objdump on the final binary.

Shell
objdump -d -C --no-show-raw-insn zig-out/bin/fibonacci-bench | less

I searched for the main loop and found the smoking gun.

Plain Text
mov    $0xcb228,%eax

0xcb228 in hex is 832040 in decimal. That is the 30th Fibonacci number. The compiler deleted my function entirely and just replaced it with the answer. I was benchmarking how fast the CPU could move a number into a register.

The Failed Attempts

I thought this would be an easy fix. The standard advice in benchmarking is to use a “black box” function that tricks the compiler. In Zig, this is std.mem.doNotOptimizeAway.

Zig
var args = .{ 30 };
std.mem.doNotOptimizeAway(&args);
try execute(function, args);

I ran it again. The result was still 0.43ns.

The problem is that doNotOptimizeAway prevents the compiler from deleting the variable, but it does not hide the value. The compiler saw that args was initialized to 30. It saw that doNotOptimizeAway read the memory but did not change it. So when I called execute, the compiler still knew the value was 30.

The Fix: volatile

To defeat the compiler optimizations, I had to be aggressive. I needed to force a volatile load.

A volatile load tells the compiler that the memory at a specific address might change at any time. It could be changed by hardware, another thread, or cosmic rays. The compiler is forced to read the value from memory right before using it. It cannot assume the value is 30.

The Zig documentation actually warns against this:

If you see code that is using volatile for something other than Memory Mapped Input/Output, it is probably a bug.

This is good advice for application logic (volatile is not thread-safe). But for benchmarking, we want to trick the compiler. We are effectively lying to LLVM, pretending our stack variable is a hardware register that could change unpredictably.

Implementing this in Zig was interesting because of types. The literal 30 is a comptime_int. Volatile pointers need a real runtime type like u64.

I had to write some code to inspect the function arguments and build a runtime tuple.

Zig
/// Constructs the runtime argument tuple based on function parameters and input args.
fn createRuntimeArgs(function: anytype, args: anytype) RuntimeArgsType(@TypeOf(function), @TypeOf(args)) {
    const TupleType = RuntimeArgsType(@TypeOf(function), @TypeOf(args));
    var runtime_args: TupleType = undefined;

    // We only need the length here to iterate
    const fn_params = getFnParams(@TypeOf(function));

    inline for (0..fn_params.len) |i| {
        runtime_args[i] = args[i];
    }
    return runtime_args;
}

/// Computes the precise Tuple type required to hold the arguments.
fn RuntimeArgsType(comptime FnType: type, comptime ArgsType: type) type {
    const fn_params = getFnParams(FnType);
    const args_fields = @typeInfo(ArgsType).@"struct".fields;
    comptime var types: [fn_params.len]type = undefined;
    inline for (fn_params, 0..) |p, i| {
        if (p.type) |t| {
            types[i] = t;
        } else {
            types[i] = args_fields[i].type;
        }
    }
    return std.meta.Tuple(&types);
}

/// Helper to unwrap function pointers and retrieve parameter info
fn getFnParams(comptime FnType: type) []const std.builtin.Type.Fn.Param {
    return @typeInfo(unwrapFnType(FnType)).@"fn".params;
}

Then I use it like this inside the run function:

Zig
var runtime_args = createRuntimeArgs(function, args);
const volatile_args_ptr: *volatile @TypeOf(runtime_args) = &runtime_args;

for (0..options.warmup_iters) |_| {
    try execute(function, volatile_args_ptr.*);
}

By reading from volatile_args_ptr, LLVM treats the data as “unknown”. It has no choice but to generate the code to call the function.

The Result

After applying the fix, the numbers finally made sense.

Shell
$ zig build fibonacci
Benchmark Summary: 2 benchmarks run
├─ fibNaive         1.76ms     [baseline]
  └─ cycles: 8.1M      instructions: 27.8M     ipc: 3.41       miss: 1
└─ fibIterative     3.83ns     459134.36x faster
   └─ cycles: 16        instructions: 83        ipc: 5.18       miss: 0

fibNaive went from 0.43ns to 1.76ms. The recursive version is slow, and the iterative version is fast.

Checking the assembly again confirmed it. The magic number was gone, replaced by actual call instructions.

This was a good reminder that compilers are often smarter than we expect. When benchmarking, you have to fight for your right to run slow code.

References

Update: 2025-12-09

After publishing this, I shared it on the Ziggit forum. The community pointed out that while volatile works, it forces a hardware memory load which technically changes what is being measured (adding memory latency that wouldn’t exist in a hot loop).

It turns out std.mem.doNotOptimizeAway can work, but it requires a very specific setup to actually blind the compiler.

I created a simplified reproduction to prove exactly when constant folding happens. You can verify this on Godbolt.

Zig
const std = @import("std");

export fn fib(n: u64) u64 {
    if (n == 0) return 0;
    var a: u64 = 0;
    var b: u64 = 1;
    for (2..n + 1) |_| {
        const c = a + b;
        a = b;
        b = c;
    }
    return b;
}

export fn run_with_u64() void {
    const n: u64 = 30;
    std.mem.doNotOptimizeAway(n); // Passes by value
    const result = fib(n);
    std.mem.doNotOptimizeAway(result);
}

export fn run_with_mutable_pointer() void {
    const n: u64 = 30;
    var x = n; // Create mutable copy
    std.mem.doNotOptimizeAway(&x); // Pass address to "clobber" memory
    const result = fib(x);
    std.mem.doNotOptimizeAway(result);
}

The assembly output for run_with_u64 still shows the hardcoded answer (832040), meaning DNOA failed:

Plain Text
run_with_u64:
        push    rbp
        mov     rbp, rsp
        mov     eax, 30
        mov     eax, 832040       ; <=== Answer hardcoded.
        pop     rbp
        ret

But run_with_mutable_pointer generates real logic:

Plain Text
run_with_mutable_pointer:
        push    rbp
        mov     rbp, rsp
        mov     qword ptr [rbp - 8], 30
        ; ... loop logic ...
        add     rcx, -2
        mov     eax, esi
        and     eax, 7
        cmp     rcx, 7
        ; ...

The key is passing a pointer to a mutable (var) variable. This hits the inline assembly “memory” clobber constraint, forcing LLVM to assume the value at that address has changed, without the hardware cost of volatile.

Based on this, I updated bench to remove the volatile cast and use the mutable pointer trick instead.

git fiff
diff --git a/examples/fibonacci.zig b/examples/fibonacci.zig
index 489eb81..9434580 100644
--- a/examples/fibonacci.zig
+++ b/examples/fibonacci.zig
@@ -26,8 +26,8 @@ pub fn main() !void {
         .sample_size = 100,
         .warmup_iters = 3,
     };
-    const m_naive = try bench.run(allocator, "fibNaive", fibNaive, .{@as(u64, 30)}, opts);
-    const m_iter = try bench.run(allocator, "fibIterative", fibIterative, .{@as(u64, 30)}, opts);
+    const m_naive = try bench.run(allocator, "fibNaive", fibNaive, .{30}, opts);
+    const m_iter = try bench.run(allocator, "fibIterative", fibIterative, .{30}, opts);

     try bench.report(.{
         .metrics = &.{ m_naive, m_iter },
diff --git a/src/root.zig b/src/root.zig
index 05f2620..72ae0b8 100644
--- a/src/root.zig
+++ b/src/root.zig
@@ -47,10 +47,10 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt

     // ref: https://pyk.sh/blog/2025-12-08-bench-fixing-constant-folding
     var runtime_args = createRuntimeArgs(function, args);
-    const volatile_args_ptr: *volatile @TypeOf(runtime_args) = &runtime_args;
+    std.mem.doNotOptimizeAway(&runtime_args);

     for (0..options.warmup_iters) |_| {
-        try execute(function, volatile_args_ptr.*);
+        try execute(function, runtime_args);
     }

     // We need to determine a batch_size such that the total execution time of the batch
@@ -63,7 +63,7 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
     while (true) {
         timer.reset();
         for (0..batch_size) |_| {
-            try execute(function, volatile_args_ptr.*);
+            try execute(function, runtime_args);
         }
         const duration = timer.read();

@@ -89,7 +89,7 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
     for (0..options.sample_size) |i| {
         timer.reset();
         for (0..batch_size) |_| {
-            try execute(function, volatile_args_ptr.*);
+            try execute(function, runtime_args);
         }
         const total_ns = timer.read();
         // Average time per operation for this batch
@@ -142,7 +142,7 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
             try perf.capture();
             for (0..options.sample_size) |_| {
                 for (0..batch_size) |_| {
-                    try execute(function, volatile_args_ptr.*);
+                    try execute(function, runtime_args);
                 }
             }
             try perf.stop();
@@ -168,6 +168,7 @@ pub fn run(allocator: Allocator, name: []const u8, function: anytype, args: anyt
 inline fn execute(function: anytype, args: anytype) !void {
     const FnType = unwrapFnType(@TypeOf(function));
     const return_type = @typeInfo(FnType).@"fn".return_type.?;
+
     // Conditional execution based on whether the function can fail
     if (@typeInfo(return_type) == .error_union) {
         const result = try @call(.auto, function, args);