Fixing Zig Microbenchmark Accuracy

December 7, 2025

This is the first dev log for bench, a tiny benchmarking library I’m building for Zig. My goal is to create a zero-dependency tool that can measure performance accurately, from heavy I/O operations down to single CPU instructions.

However, I recently ran into a major issue when trying to benchmark extremely fast operations like integer addition or bitwise shifts. Here is a breakdown of the problem and how I implemented adaptive batching to fix it.

The Timer Resolution Problem

The initial implementation of bench was simple. It ran the target function in a loop, measuring the elapsed time for each iteration individually.

src/root.zig

for (0..options.sample_size) |i| {
    timer.reset();
    try @call(.auto, function, args);
    samples[i] = timer.read();
}

This works fine for functions that take milliseconds to run. But when I tried to benchmark a simple add operation (which should take < 1ns), the results were wildly inaccurate. The reporter claimed it took 40ns.

The root cause is the Timer Noise.

System timers (like clock_gettime on Linux) have a resolution and latency. Reading the clock itself takes time, often around 20-40ns. If the function you are measuring takes 0.5ns, you aren’t measuring the function, you are measuring the overhead of the stopwatch.

I wrote a quick proof-of-concept to confirm this. I compared the library’s measurement against a manual loop of 10,000 iterations:

Plain Text

1. Naive Measurement (Current Implementation):
   Reported Median: 42.00 ns

2. Batched Measurement (Simulated Fix):
   Real Cost/Op:    0.35 ns

The discrepancy was massive. The library was reporting results 120x slower than reality.

Implementing Adaptive Batching

To match the accuracy of tools like nanobench, I implemented an adaptive upscaling strategy. Instead of measuring a single call, the library now automatically calculates how many times it needs to run the function to reach a measurable time threshold (target: 1ms).

I added a calibration step before the main sampling loop:

src/root.zig

const min_sample_time_ns = 1_000_000; // 1ms
var batch_size: u64 = 1;
var timer = try Timer.start();

while (true) {
    timer.reset();
    for (0..batch_size) |_| {
        std.mem.doNotOptimizeAway(function);
        std.mem.doNotOptimizeAway(args);
        try @call(.auto, function, args);
    }
    const duration = timer.read();

    if (duration >= min_sample_time_ns) break;

    // Scale up
    if (duration == 0) {
        batch_size *= 10;
    } else {
        // Calculate exact multiplier needed
        const ratio = @as(f64, @floatFromInt(min_sample_time_ns)) / @as(f64, @floatFromInt(duration));
        batch_size *= @as(u64, @intFromFloat(std.math.ceil(ratio)));
    }
}

Now, if a function is too fast, bench will automatically scale up to run it 10,000 or 100,000 times in a tight loop, measure the total duration, and then divide by the batch size to get the per-operation cost.

Moving to Sub-Nanosecond Precision

With adaptive batching, we are now dealing with measurements like 0.35ns. The previous Metrics struct used u64 to store nanoseconds. Storing 0 for an operation that actually takes time isn’t useful, so I had to refactor the entire metrics engine to use f64.

src/root.zig

pub const Metrics = struct {
    name: []const u8,
    // Time (f64 to support sub-nanosecond precision)
    min_ns: f64,
    max_ns: f64,
    mean_ns: f64,
    median_ns: f64,
    // ...
};

This required updating the calculations for mean, variance, and standard deviation, but the result is worth it. We can now accurately detect the difference between 0.5ns (simple add) and 1.0ns (dependent add), which is critical for low-level optimizations.

The core measurement engine is now robust enough for micro-benchmarks.

Next, I plan to improve the reporting. Right now, it just dumps a table to stdout. I want to add a generic Reporter interface so users can output JSON for CI pipelines or CSV for plotting.

I’m also looking into “relative” assertions. Instead of asserting that a function takes less than 100ns (which is flaky across different machines), I want to assert that fast_algo is at least 2x faster than slow_algo within the same run.

Tags	zig , bench