Trying Out SIMD & SWAR in Zig
I usually spend my time breaking smart contracts or building tools in Rust. But lately I have been experimenting with Zig. Zig is a lower level language that gives you a lot of control over memory and execution.
I decided to write a JSON parser in Zig to learn the language better. The first thing a JSON parser needs to do is skip whitespace. This sounds simple but JSON files can be full of spaces, tabs and newlines. If you cannot skip them fast your parser will be slow.
Here is a notes on my attempt to create the fastest whitespace skipper in Zig.
The Problem
JSON comes in many shapes. Sometimes it is “minified” which means it has no extra spaces at all.
{"key":"value"}Sometimes it is “compact” where there is just a single space after a colon.
{"key": "value"}And usually it is “pretty printed” with indentation. This can be 2 spaces, 4 spaces or even tabs.
{
"key": "value"
}I need a function that handles all these cases efficiently.
My Setup
I used my own benchmarking tool called pyk/bench. It is a library I wrote for Zig that helps me to do microbenchmarking. I recently updated it to output Markdown tables which makes it perfect for this note.
I started with the most basic implementation:
pub fn skipWhitespaceIf(buffer: []const u8, start: usize) usize {
var i = start;
while (i < buffer.len) {
const char = buffer[i];
if (char == ' ' or char == '\t' or char == '\n' or char == '\r') {
i += 1;
} else {
break;
}
}
return i;
}I also wrote a version using a switch statement to see if the compiler
optimized it differently.
pub fn skipWhitespaceSwitch(buffer: []const u8, start: usize) usize {
var i = start;
while (i < buffer.len) {
const char = buffer[i];
switch (char) {
' ', '\t', '\n', '\r' => i += 1,
else => break,
}
}
return i;
}Here is the benchmark:
const std = @import("std");
const builtin = @import("builtin");
const bench = @import("bench");
pub fn skipWhitespaceIf(buffer: []const u8, start: usize) usize {
var i = start;
while (i < buffer.len) {
const char = buffer[i];
if (char == ' ' or char == '\t' or char == '\n' or char == '\r') {
i += 1;
} else {
break;
}
}
return i;
}
pub fn skipWhitespaceSwitch(buffer: []const u8, start: usize) usize {
var i = start;
while (i < buffer.len) {
const char = buffer[i];
switch (char) {
' ', '\t', '\n', '\r' => i += 1,
else => break,
}
}
return i;
}
const SkipWhitespaceFn = *const fn (buffer: []const u8, start: usize) usize;
const Implementation = struct {
name: []const u8,
func: SkipWhitespaceFn,
};
const implementations = [_]Implementation{
.{ .name = "skipWhitespaceIf", .func = skipWhitespaceIf },
.{ .name = "skipWhitespaceSwitch", .func = skipWhitespaceSwitch },
};
const TestCase = struct {
name: []const u8,
ws_len: usize,
};
const test_configs = [_]TestCase{
.{ .name = "No Whitespace", .ws_len = 0 },
.{ .name = "Single Whitespace", .ws_len = 1 },
.{ .name = "Single Indent", .ws_len = 4 },
.{ .name = "Two Indent", .ws_len = 8 },
.{ .name = "Three Indent", .ws_len = 12 },
};
pub fn main() !void {
const allocator = std.heap.page_allocator;
inline for (test_configs) |config| {
std.debug.print("\n# {s}\n\n", .{config.name});
// Allocate large buffer (~1KB) to force loop iterations; only prefix matters
const buf_len = 1024;
const buffer = try allocator.alloc(u8, buf_len);
defer allocator.free(buffer);
// Fill the buffer
@memset(buffer[0..config.ws_len], ' ');
@memset(buffer[config.ws_len..], 'a');
var metrics: [implementations.len]bench.Metrics = undefined;
inline for (implementations, 0..) |impl, i| {
metrics[i] = try bench.run(allocator, impl.name, impl.func, .{ buffer, 0 }, .{});
}
try bench.report(.{ .metrics = &metrics, .baseline_index = 0 });
}
}The report:
# No Whitespace
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 1.52 ns | 1.00x | 1066688 | 658.8M/s | 7.0 | 19.0 | 2.71 | 0.0 |
| `skipWhitespaceSwitch` | 1.51 ns | 1.00x | 666680 | 661.0M/s | 7.0 | 19.0 | 2.71 | 0.0 |
# Single Whitespace
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 1.73 ns | 1.00x | 1066688 | 577.8M/s | 8.0 | 27.0 | 3.37 | 0.0 |
| `skipWhitespaceSwitch` | 1.73 ns | 1.00x | 600000 | 578.3M/s | 8.0 | 27.0 | 3.37 | 0.0 |
# Single Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 2.54 ns | 1.00x | 733348 | 393.4M/s | 11.8 | 51.0 | 4.32 | 0.0 |
| `skipWhitespaceSwitch` | 2.54 ns | 1.00x | 433342 | 393.3M/s | 11.9 | 51.0 | 4.29 | 0.0 |
# Two Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 3.81 ns | 1.00x | 480000 | 262.4M/s | 17.8 | 83.0 | 4.68 | 0.0 |
| `skipWhitespaceSwitch` | 3.79 ns | 1.00x | 300006 | 263.6M/s | 18.0 | 83.0 | 4.60 | 0.0 |
# Three Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 5.27 ns | 1.00x | 400008 | 189.7M/s | 24.4 | 115.0 | 4.72 | 0.0 |
| `skipWhitespaceSwitch` | 5.20 ns | 1.01x | 233338 | 192.2M/s | 23.5 | 115.0 | 4.89 | 0.0 |These two functions perform about the same.
The compiler is smart enough to turn the if chain and the switch into
similar machine code.
Trying SIMD
To make this faster I wanted to use SIMD. This stands for Single Instruction Multiple Data. It allows the CPU to compare many bytes at once instead of one by one.
Zig makes this easy with @Vector. You can load a chunk of memory into a vector
and perform operations on all of it at the same time.
Here is my first attempt using 32-byte vectors.
pub fn skipWhitespaceSimd(buffer: []const u8, start: usize) usize {
const VectorSize = 32;
const V = @Vector(VectorSize, u8);
var i = start;
while (i + VectorSize <= buffer.len) {
const chunk: V = buffer[i..][0..VectorSize].*;
const is_space = chunk == @as(V, @splat(' '));
const is_tab = chunk == @as(V, @splat('\t'));
const is_nl = chunk == @as(V, @splat('\n'));
const is_cr = chunk == @as(V, @splat('\r'));
const is_ws = is_space | is_tab | is_nl | is_cr;
const mask: u32 = @bitCast(~is_ws);
if (mask == 0) {
i += VectorSize;
} else {
const offset = @ctz(mask);
return i + offset;
}
}
// Fallback for the end of the buffer
while (i < buffer.len) {
switch (buffer[i]) {
' ', '\t', '\n', '\r' => i += 1,
else => break,
}
}
return i;
}This looks powerful but it has a flaw. Setting up a large vector takes time. If the whitespace is short the CPU spends more time getting ready than actually doing the work.
The benchmark report:
# No Whitespace
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 1.30 ns | 1.00x | 1050000 | 768.7M/s | 7.0 | 19.0 | 2.71 | 0.0 |
| `skipWhitespaceSwitch` | 1.30 ns | 1.00x | 666680 | 771.0M/s | 7.0 | 19.0 | 2.71 | 0.0 |
| `skipWhitespaceSimd` | 1.52 ns | 0.86x | 600012 | 659.9M/s | 8.0 | 35.0 | 4.37 | 0.0 |
# Single Whitespace
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 1.73 ns | 1.00x | 533344 | 577.7M/s | 9.0 | 27.0 | 3.00 | 0.0 |
| `skipWhitespaceSwitch` | 1.73 ns | 1.00x | 533344 | 578.4M/s | 9.0 | 27.0 | 3.00 | 0.0 |
| `skipWhitespaceSimd` | 1.52 ns | 1.14x | 600012 | 659.9M/s | 8.0 | 35.0 | 4.37 | 0.0 |
# Single Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 3.03 ns | 1.00x | 300006 | 329.9M/s | 15.0 | 51.0 | 3.40 | 0.0 |
| `skipWhitespaceSwitch` | 3.03 ns | 1.00x | 600012 | 330.4M/s | 15.0 | 51.0 | 3.40 | 0.0 |
| `skipWhitespaceSimd` | 1.52 ns | 2.00x | 600000 | 659.4M/s | 8.0 | 35.0 | 4.37 | 0.0 |
# Two Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 4.76 ns | 1.00x | 200000 | 210.0M/s | 23.0 | 83.0 | 3.61 | 0.0 |
| `skipWhitespaceSwitch` | 4.76 ns | 1.00x | 220000 | 210.2M/s | 23.0 | 83.0 | 3.61 | 0.0 |
| `skipWhitespaceSimd` | 1.52 ns | 3.13x | 600000 | 656.6M/s | 8.0 | 35.0 | 4.37 | 0.0 |
# Three Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 6.49 ns | 1.00x | 150003 | 154.1M/s | 31.0 | 115.0 | 3.71 | 0.0 |
| `skipWhitespaceSwitch` | 6.49 ns | 1.00x | 150000 | 154.2M/s | 31.0 | 115.0 | 3.71 | 0.0 |
| `skipWhitespaceSimd` | 1.52 ns | 4.27x | 600000 | 658.4M/s | 8.0 | 35.0 | 4.37 | 0.0 |Trying SWAR
Since SIMD felt “heavy” I tried a technique called SWAR. This stands for SIMD
Within A Register. It uses standard 64-bit integers (u64) to process 8 bytes
at a time using bitwise math.
The logic is a bit complex. You use XOR and subtraction to detect zero bytes. I learned this trick from aqrit’s despacer.
pub fn skipWhitespaceSwar(buffer: []const u8, start: usize) usize {
var i = start;
const len = buffer.len;
// We check: Is there any byte in this u64 that is NOT whitespace?
while (i + 8 <= len) {
// Load 8 bytes as a u64 (little endian usually preferred for bit tricks)
const word = std.mem.readInt(u64, buffer[i..][0..8], .little);
// SWAR Logic: Detect Non-Whitespace
// We want to find bytes that are NOT: ' ' (0x20) or \t\n\r (0x09-0x0D)
// Check for Space (0x20)
// XOR with space puts 0x00 in the byte if it was a space.
const xor_space = word ^ 0x2020202020202020;
// Standard "has zero byte" check: (v - 0x01) & ~v & 0x80
const is_space = (xor_space -% 0x0101010101010101) & ~xor_space & 0x8080808080808080;
// Check for Control Chars (0x09 - 0x0D)
// We can subtract 0x09. If byte was < 0x09, it wraps to high value (high bit set?? no).
// A simpler (heuristic) check for standard JSON implies we only care about \t, \n, \r.
// Let's use the explicit XOR SWAR from the C code for exactness:
const xor_tab = word ^ 0x0909090909090909;
const xor_lf = word ^ 0x0A0A0A0A0A0A0A0A;
const xor_cr = word ^ 0x0D0D0D0D0D0D0D0D;
// Helper to find zero bytes for these 3
const is_tab = (xor_tab -% 0x0101010101010101) & ~xor_tab & 0x8080808080808080;
const is_lf = (xor_lf -% 0x0101010101010101) & ~xor_lf & 0x8080808080808080;
const is_cr = (xor_cr -% 0x0101010101010101) & ~xor_cr & 0x8080808080808080;
// Combine: If a byte is whitespace, its high bit in the result will be 1.
const is_ws = is_space | is_tab | is_lf | is_cr;
// If all bytes are whitespace, is_ws should be 0x8080808080808080
if (is_ws == 0x8080808080808080) {
i += 8;
} else {
break;
}
}
// Scalar Cleanup
while (i < len) {
switch (buffer[i]) {
' ', '\t', '\n', '\r' => i += 1,
else => break,
}
}
return i;
}This was faster than the basic scalar loop for medium lengths but the bit manipulation overhead made it slower for very short strings.
The report:
# No Whitespace
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 1.30 ns | 1.00x | 1250000 | 770.3M/s | 6.0 | 19.0 | 3.17 | 0.0 |
| `skipWhitespaceSwitch` | 1.30 ns | 1.00x | 800016 | 771.0M/s | 6.0 | 19.0 | 3.17 | 0.0 |
| `skipWhitespaceSimd` | 1.51 ns | 0.86x | 700000 | 660.1M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 1.73 ns | 0.75x | 600000 | 578.1M/s | 8.0 | 40.0 | 5.00 | 0.0 |
# Single Whitespace
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 1.51 ns | 1.00x | 700000 | 660.4M/s | 7.0 | 27.0 | 3.85 | 0.0 |
| `skipWhitespaceSwitch` | 1.51 ns | 1.00x | 700000 | 660.5M/s | 7.0 | 27.0 | 3.86 | 0.0 |
| `skipWhitespaceSimd` | 1.52 ns | 1.00x | 666680 | 659.1M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 2.16 ns | 0.70x | 466676 | 462.6M/s | 10.0 | 48.0 | 4.80 | 0.0 |
# Single Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 2.32 ns | 1.00x | 466676 | 430.2M/s | 10.9 | 51.0 | 4.66 | 0.0 |
| `skipWhitespaceSwitch` | 2.30 ns | 1.01x | 500000 | 434.2M/s | 10.9 | 51.0 | 4.69 | 0.0 |
| `skipWhitespaceSimd` | 1.52 ns | 1.53x | 666680 | 659.3M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 4.11 ns | 0.57x | 300006 | 243.5M/s | 16.0 | 72.0 | 4.50 | 0.0 |
# Two Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 4.03 ns | 1.00x | 300000 | 248.0M/s | 17.9 | 83.0 | 4.64 | 0.0 |
| `skipWhitespaceSwitch` | 3.89 ns | 1.04x | 300000 | 257.2M/s | 18.3 | 83.0 | 4.54 | 0.0 |
| `skipWhitespaceSimd` | 1.63 ns | 2.48x | 666680 | 614.3M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 2.16 ns | 1.86x | 466676 | 462.2M/s | 10.0 | 57.0 | 5.70 | 0.0 |
# Three Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :--------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 5.57 ns | 1.00x | 225000 | 179.4M/s | 25.5 | 115.0 | 4.52 | 0.0 |
| `skipWhitespaceSwitch` | 5.40 ns | 1.03x | 250000 | 185.0M/s | 25.5 | 115.0 | 4.52 | 0.0 |
| `skipWhitespaceSimd` | 1.52 ns | 3.68x | 666680 | 659.5M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 3.89 ns | 1.43x | 300000 | 257.0M/s | 18.0 | 89.0 | 4.94 | 0.0 |The Winner: SIMD Inline
After running many benchmarks I found the best approach. It is surprisingly simple.
I call it skipWhitespaceSimdInline. It does two things.
First it checks the very first byte using a simple scalar check. This handles the “minified” case instantly. If the first byte is not whitespace we return immediately.
Second it jumps straight into a 16-byte SIMD loop. I moved from 32 bytes to 16 bytes because it fits into more common CPU registers and has less setup cost.
Here is the winning implementation:
inline fn isWhitespace(c: u8) bool {
return c == ' ' or c == '\t' or c == '\n' or c == '\r';
}
pub fn skipWhitespaceSimdInline(buffer: []const u8, start: usize) usize {
var i = start;
const len = buffer.len;
if (i < len and !isWhitespace(buffer[i])) return i; // handle minified
const VectorSize = 16; // use smaller from previous
const V = @Vector(VectorSize, u8);
while (i + VectorSize <= len) {
const chunk: V = buffer[i..][0..VectorSize].*;
// Parallel comparisons (Inline, no LUT)
const is_space = chunk == @as(V, @splat(' '));
const is_tab = chunk == @as(V, @splat('\t'));
const is_nl = chunk == @as(V, @splat('\n'));
const is_cr = chunk == @as(V, @splat('\r'));
// Combine
const is_ws = is_space | is_tab | is_nl | is_cr;
// Create bitmask (1 = non-whitespace)
const mask: u16 = @bitCast(~is_ws);
if (mask != 0) {
// Found a non-whitespace character.
// ctz finds the index of the first '1' bit.
return i + @ctz(mask);
}
// All 16 bytes were whitespace. Move to next chunk.
i += VectorSize;
}
// Back to scalar
while (i < len) {
if (!isWhitespace(buffer[i])) break;
i += 1;
}
return i;
}This simple check is very important here.
if (i < len and !isSpace(buffer[i])) return i; // handle minifiedBy checking the first byte we filter out the {"a":1} case in about 1
nanosecond. If it is whitespace we just fall through to the SIMD loop. We do not
try to be clever with SWAR or intermediate steps.
And the results were amazing, I got increased performance as the whitespace count grew:
# No Whitespace
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :------------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 1.73 ns | 1.00x | 533344 | 576.7M/s | 7.0 | 19.0 | 2.71 | 0.0 |
| `skipWhitespaceSwitch` | 1.73 ns | 1.00x | 666680 | 577.9M/s | 7.0 | 19.0 | 2.71 | 0.0 |
| `skipWhitespaceSimd` | 1.73 ns | 1.00x | 700000 | 576.4M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 1.95 ns | 0.89x | 600000 | 513.8M/s | 8.0 | 40.0 | 5.00 | 0.0 |
| `skipWhitespaceSimdInline` | 1.73 ns | 1.00x | 666680 | 578.2M/s | 7.0 | 17.0 | 2.43 | 0.0 |
# Single Whitespace
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :------------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 1.95 ns | 1.00x | 733348 | 512.8M/s | 8.0 | 27.0 | 3.37 | 0.0 |
| `skipWhitespaceSwitch` | 1.95 ns | 1.00x | 1096806 | 513.9M/s | 8.0 | 27.0 | 3.37 | 0.0 |
| `skipWhitespaceSimd` | 1.73 ns | 1.13x | 650000 | 577.4M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 2.16 ns | 0.90x | 533344 | 462.1M/s | 9.0 | 48.0 | 5.33 | 0.0 |
| `skipWhitespaceSimdInline` | 1.73 ns | 1.13x | 666680 | 577.6M/s | 7.0 | 38.0 | 5.43 | 0.0 |
# Single Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :------------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 2.60 ns | 1.00x | 433342 | 384.5M/s | 11.8 | 51.0 | 4.33 | 0.0 |
| `skipWhitespaceSwitch` | 2.59 ns | 1.00x | 433342 | 385.4M/s | 11.4 | 51.0 | 4.47 | 0.0 |
| `skipWhitespaceSimd` | 1.73 ns | 1.50x | 666680 | 577.9M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 3.24 ns | 0.80x | 333340 | 308.4M/s | 14.6 | 72.0 | 4.94 | 0.0 |
| `skipWhitespaceSimdInline` | 1.73 ns | 1.50x | 666680 | 576.9M/s | 7.0 | 38.0 | 5.43 | 0.0 |
# Two Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :------------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 4.13 ns | 1.00x | 300000 | 242.0M/s | 18.9 | 83.0 | 4.40 | 0.0 |
| `skipWhitespaceSwitch` | 3.95 ns | 1.04x | 300006 | 252.9M/s | 17.4 | 83.0 | 4.77 | 0.0 |
| `skipWhitespaceSimd` | 1.73 ns | 2.39x | 700000 | 577.5M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 2.38 ns | 1.73x | 466676 | 419.6M/s | 10.0 | 57.0 | 5.70 | 0.0 |
| `skipWhitespaceSimdInline` | 1.74 ns | 2.38x | 666680 | 576.3M/s | 7.0 | 38.0 | 5.43 | 0.0 |
# Three Indent
| Benchmark | Time | Speedup | Iterations | Ops/s | Cycles | Instructions | IPC | Cache Misses |
| :------------------------- | ------: | ------: | ---------: | -------: | -----: | -----------: | ---: | -----------: |
| `skipWhitespaceIf` | 5.58 ns | 1.00x | 219519 | 179.4M/s | 25.1 | 115.0 | 4.59 | 0.0 |
| `skipWhitespaceSwitch` | 5.48 ns | 1.02x | 233338 | 182.6M/s | 24.5 | 115.0 | 4.70 | 0.0 |
| `skipWhitespaceSimd` | 1.74 ns | 3.21x | 1266692 | 576.3M/s | 7.0 | 35.0 | 5.00 | 0.0 |
| `skipWhitespaceSwar` | 3.68 ns | 1.52x | 300006 | 271.8M/s | 16.8 | 89.0 | 5.29 | 0.0 |
| `skipWhitespaceSimdInline` | 1.74 ns | 3.21x | 700000 | 575.7M/s | 7.0 | 38.0 | 5.43 | 0.0 |I learned that for this specific problem simpler is often better.
- Scalar is fast for small things. You cannot beat a simple
ifcheck for 1 byte. - SIMD is fast. Once you have more than 8 or 16 bytes vector instructions are incredible.
- The middle ground is a trap. Trying to optimize the 4-8 byte range with SWAR or complex branching just added overhead that slowed down everything else.
The final code is robust. It handles minified JSON instantly and chews through pretty-printed JSON with massive throughput. This is going into my parser.
| Tags | zig , bench , simd , parser |
|---|