Making checkrs 17x faster

checkrs is my personal Rust linter. It is powered by ast-grep, which means I write lint rules as tree-sitter queries in Python and ast-grep handles the parsing.

It has been working well but I noticed it was getting slow. I ran it against a project with 75 Rust files and it took 5.4 seconds. That felt wrong. Parsing 75 files and running some pattern matching should take milliseconds, not seconds. I decided to investigate.

Profiling

I started by running cProfile to see where the time was going.

Plain Text
         609593 function calls (481349 primitive calls) in 1.560 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4739    0.029    0.000    3.069    0.001 checkrs/lints/lint.py:27(make_config)
     4739    0.017    0.000    3.001    0.002 .../ast.py:50(literal_eval)
   123454    0.109    0.000    2.975    0.001 .../ast.py:80(_convert)

Wait, why is make_config taking 3 seconds? And why is ast.literal_eval being called 4739 times? That number seemed suspicious. I have 64 lint rules and 75 files, and 64 times 75 is roughly 4800. So make_config was being called once per lint per file.

Problem 1: Every lint parsed every file

I looked at how a typical lint was implemented.

Python
def check(self, file_path: Path, source: str) -> list[Violation]:
    root = ast_grep_py.SgRoot(source, "rust")
    node = root.root()
    matches = list(node.find_all(config))
    ...

Every single lint was creating its own SgRoot. That means 75 files times 64 lints equals 4800 redundant parses. The AST was being rebuilt from scratch for every lint. Tree-sitter parsing is fast, but not that fast.

The fix was obvious: parse once and reuse the root node.

Python
# In the runner, parse once per file
root = ast_grep_py.SgRoot(source, "rust")
node = root.root()

# Pass the node to every lint
for lint in get_all_lints():
    violations = lint.check(file_path, node)

I updated the Lint.check signature across all 64 lint files.

Python
# Before
def check(self, file_path: Path, source: str) -> list[Violation]:
    root = ast_grep_py.SgRoot(source, "rust")
    node = root.root()
    ...

# After
def check(self, file_path: Path, node: ast_grep_py.SgNode) -> list[Violation]:
    matches = list(node.find_all(config))
    ...

This alone dropped the runtime from 5.4 seconds to 1.5 seconds.

Problem 2: make_config was rebuilding configs from strings

The profile still showed make_config at the top. I looked at the implementation.

Python
def make_config(**kwargs: object) -> ast_grep_py.Config:
    d = ast.literal_eval(repr(kwargs))
    return ast_grep_py.Config(**d)

This function was turning a dict into a string with repr, then parsing it back into a dict with ast.literal_eval. I have no idea why it was written this way. But it meant every call to make_config was serializing and deserializing the entire config dict.

The configs are mostly static. The same lint uses the same config for every file. So I replaced this with json.dumps and an lru_cache.

Python
@lru_cache(maxsize=256)
def _cached_config(key: str) -> ast_grep_py.Config:
    return json.loads(key)


def make_config(**kwargs: object) -> ast_grep_py.Config:
    return _cached_config(json.dumps(kwargs, sort_keys=True))

Now each unique config is built exactly once. The make_config overhead dropped from about 3 seconds to basically nothing.

After this change the runtime was down to about 1.1 seconds.

Problem 3: ThreadPoolExecutor was making it slower

The code was using ThreadPoolExecutor with 20 threads to lint files in parallel. I ran a quick benchmark with different worker counts.

Plain Text
1 worker:  1.108s
2 workers: 1.163s
4 workers: 1.187s
8 workers: 1.199s
20 workers: 1.210s

More threads made it slower. This is the GIL. Python threads cannot run in parallel when the work is CPU-bound, and ast-grep’s tree-sitter queries are pure CPU work. The threads were just adding context-switching overhead.

I removed the thread pool entirely and went single-threaded. That brought the time down to about 1.0 seconds. But I wanted sub-second.

Problem 4: No real parallelism

The GIL prevents threads from parallelizing CPU work, but processes can. On Linux I can use multiprocessing.Pool with the fork context. This creates new processes by forking the current one, which is fast because it copies the memory pages without actually duplicating the data until it is written.

Python
if sys.platform == "linux":
    ctx = get_context("fork")
    max_workers = min(8, (os.cpu_count() or 1))
    with ctx.Pool(processes=max_workers) as pool:
        results = pool.map(_check_file, files, chunksize=...)
else:
    results = [_check_file(f) for f in files]

I had to verify that fork multiprocessing did not break anything. Since the lints are stateless and _check_file only reads files, it was safe. I compared the single-threaded output against the multiprocessing output for every file and all 143 tests passed.

Results

The final numbers:

Plain Text
$ time checkrs run external/pyk/raptor

real    0m0.319s
user    0m1.259s
sys     0m0.081s

From 5.4 seconds down to 0.32 seconds. That is a 17x speedup.

What I learned

This optimization taught me a few things I will keep in mind.

First, caching matters even for “cheap” operations. Building a dict 4800 times by serializing and deserializing it is not cheap. Caching is not just for database queries or network calls.

Second, the GIL is real. I knew this in theory but it was interesting to see it so clearly in the benchmark. More threads literally made the code slower. Processes are the right tool for CPU-bound Python work, at least on Linux where fork is available.

Third, redundant work is the easiest win. Parsing every file 64 times was the biggest single improvement. It is easy to miss when each individual lint looks reasonable on its own. The cost only becomes obvious when you look at the system as a whole.

Finally, profile before optimizing. I could have guessed that threading was the problem, but the profile showed me that redundant parsing and config building were the real culprits. Without the profile I would have spent time on the wrong things.

The linter is fast enough now that I do not notice it at all. Which is exactly how a linter should feel.

Disclaimer: I used Kimi K2.6 to help with this optimization task, all results verified by me.