_ __ _ ____ _ | | / /__ _ __ _ __ ___| | __ ) ___ _ __ ___| |__ | |/ / _ \ '__| '_ \ / _ \ | _ \ / _ \ '_ \ / __| '_ \ | < __/ | | | | | __/ | |_) | __/ | | | (__| | | | |_|\_\___|_| |_| |_|\___|_|____/ \___|_| |_|\___|_| |_|
gpu kernel benchmarks for autonomous coding agents
Two benchmarks. One question: when you point a frontier model at modern GPU primitives and let it iterate, what does it actually produce? Real CLI harnesses (Claude Code, codex, Kimi, opencode), real workspaces, real correctness checks, real wall-clock budgets. peak_fraction grounded in physical hardware ceilings, not gameable speedup ratios.
Hard 2026-04
9 hand-designed problems · 13 model-harness sweeps · single Blackwell SM120 · forensic audit of every high-peak run · two rubric leaks documented inline · click any cell on the leaderboard to open the full transcript viewer for that run
v3 2026-02
43-58 problems per GPU · 10 models · RTX 3090 + H100 + B200 · 4 difficulty levels · custom KernelBench agent loop
# design principles
- >peak_fraction over speedup ratio. speedups are easy to game (slow the baseline, inflate the ratio). peak_fraction is grounded in physical limits — fraction of relevant tensor-core or DRAM bandwidth ceiling the kernel actually achieved. harder to game, more honest.
- >real coding-agent CLIs as the harness. no custom benchmark agent loop. each model runs through whatever its native developer-facing CLI is — claude code for anthropic, codex for openai, kimi cli for moonshot, opencode for everyone else. matches how engineers actually use these tools.
- >wall-clock budgets. 45 min per (model, problem) run. models with verbose tool-use patterns aren't penalized just for being chatty; they trade exploration for kernel-iteration time within the budget.
- >forensic audit of high-peak runs. every cell where a model scored above ~10% peak gets its solution.py read by a human. reward hacks, rubric leaks, and exemplary kernels all annotated in the source repo with verdict + pull quotes.
- >publish the flaws. when the rubric leaks, the leak goes in the leaderboard. five frontier models all taking the same bf16 shortcut on FP8 GEMM is a result, not a bug to quietly fix.
# contact
Open to inquiries — collaborations, model evals, custom benchmark builds, kernel-engineering consulting, anything kernel-adjacent.
Reach out: infatoshi@gmail.com