Fixing Zig Microbenchmark Accuracy
This is the first dev log for bench, a tiny benchmarking library I’m building for Zig. My goal is to create a zero-dependency tool that can measure performance accurately, from heavy I/O operations down to single CPU instructions.
However, I recently ran into a major issue when trying to benchmark extremely fast operations like integer addition or bitwise shifts. Here is a breakdown of the problem and how I implemented adaptive batching to fix it.
The Timer Resolution Problem
The initial implementation of bench was simple. It ran the target function in
a loop, measuring the elapsed time for each iteration individually.
for (0..options.sample_size) |i| {
timer.reset();
try @call(.auto, function, args);
samples[i] = timer.read();
}This works fine for functions that take milliseconds to run. But when I tried to
benchmark a simple add operation (which should take < 1ns), the results were
wildly inaccurate. The reporter claimed it took 40ns.
The root cause is the Timer Noise.
System timers (like clock_gettime on Linux) have a resolution and latency.
Reading the clock itself takes time, often around 20-40ns. If the function you
are measuring takes 0.5ns, you aren’t measuring the function, you are measuring
the overhead of the stopwatch.
I wrote a quick proof-of-concept to confirm this. I compared the library’s measurement against a manual loop of 10,000 iterations:
1. Naive Measurement (Current Implementation):
Reported Median: 42.00 ns
2. Batched Measurement (Simulated Fix):
Real Cost/Op: 0.35 nsThe discrepancy was massive. The library was reporting results 120x slower than reality.
Implementing Adaptive Batching
To match the accuracy of tools like nanobench, I implemented an adaptive upscaling strategy. Instead of measuring a single call, the library now automatically calculates how many times it needs to run the function to reach a measurable time threshold (target: 1ms).
I added a calibration step before the main sampling loop:
const min_sample_time_ns = 1_000_000; // 1ms
var batch_size: u64 = 1;
var timer = try Timer.start();
while (true) {
timer.reset();
for (0..batch_size) |_| {
std.mem.doNotOptimizeAway(function);
std.mem.doNotOptimizeAway(args);
try @call(.auto, function, args);
}
const duration = timer.read();
if (duration >= min_sample_time_ns) break;
// Scale up
if (duration == 0) {
batch_size *= 10;
} else {
// Calculate exact multiplier needed
const ratio = @as(f64, @floatFromInt(min_sample_time_ns)) / @as(f64, @floatFromInt(duration));
batch_size *= @as(u64, @intFromFloat(std.math.ceil(ratio)));
}
}Now, if a function is too fast, bench will automatically scale up to run it
10,000 or 100,000 times in a tight loop, measure the total duration, and then
divide by the batch size to get the per-operation cost.
Moving to Sub-Nanosecond Precision
With adaptive batching, we are now dealing with measurements like 0.35ns. The
previous Metrics struct used u64 to store nanoseconds. Storing 0 for an
operation that actually takes time isn’t useful, so I had to refactor the entire
metrics engine to use f64.
pub const Metrics = struct {
name: []const u8,
// Time (f64 to support sub-nanosecond precision)
min_ns: f64,
max_ns: f64,
mean_ns: f64,
median_ns: f64,
// ...
};This required updating the calculations for mean, variance, and standard
deviation, but the result is worth it. We can now accurately detect the
difference between 0.5ns (simple add) and 1.0ns (dependent add), which is
critical for low-level optimizations.
The core measurement engine is now robust enough for micro-benchmarks.
Next, I plan to improve the reporting. Right now, it just dumps a table to
stdout. I want to add a generic Reporter interface so users can output JSON
for CI pipelines or CSV for plotting.
I’m also looking into “relative” assertions. Instead of asserting that a
function takes less than 100ns (which is flaky across different machines), I
want to assert that fast_algo is at least 2x faster than slow_algo within
the same run.
| Tags | zig , bench |
|---|