./kernelbench
 _   __                    _ ____                  _
| | / /__ _ __ _ __   ___| | __ )  ___ _ __   ___| |__
| |/ / _ \ '__| '_ \ / _ \ |  _ \ / _ \ '_ \ / __| '_ \
|   <  __/ |  | | | |  __/ | |_) |  __/ | | | (__| | | |
|_|\_\___|_|  |_| |_|\___|_|____/ \___|_| |_|\___|_| |_|

gpu kernel benchmarks for autonomous coding agents

Two benchmarks. One question: when you point a frontier model at modern GPU primitives and let it iterate, what does it actually produce? Real CLI harnesses (Claude Code, codex, Kimi, opencode), real workspaces, real correctness checks, real wall-clock budgets. peak_fraction grounded in physical hardware ceilings, not gameable speedup ratios.

[ latest ]

Hard 2026-04

9 hand-designed problems · 13 model-harness sweeps · single Blackwell SM120 · forensic audit of every high-peak run · two rubric leaks documented inline · click any cell on the leaderboard to open the full transcript viewer for that run

problems
9
sweeps
13
runs
117
best peak
0.722
[ archive ]

v3 2026-02

43-58 problems per GPU · 10 models · RTX 3090 + H100 + B200 · 4 difficulty levels · custom KernelBench agent loop

GPUs
3
models
10
evaluations
1500+

# design principles

  • >peak_fraction over speedup ratio. speedups are easy to game (slow the baseline, inflate the ratio). peak_fraction is grounded in physical limits — fraction of relevant tensor-core or DRAM bandwidth ceiling the kernel actually achieved. harder to game, more honest.
  • >real coding-agent CLIs as the harness. no custom benchmark agent loop. each model runs through whatever its native developer-facing CLI is — claude code for anthropic, codex for openai, kimi cli for moonshot, opencode for everyone else. matches how engineers actually use these tools.
  • >wall-clock budgets. 45 min per (model, problem) run. models with verbose tool-use patterns aren't penalized just for being chatty; they trade exploration for kernel-iteration time within the budget.
  • >forensic audit of high-peak runs. every cell where a model scored above ~10% peak gets its solution.py read by a human. reward hacks, rubric leaks, and exemplary kernels all annotated in the source repo with verdict + pull quotes.
  • >publish the flaws. when the rubric leaks, the leak goes in the leaderboard. five frontier models all taking the same bf16 shortcut on FP8 GEMM is a result, not a bug to quietly fix.

# contact

Open to inquiries — collaborations, model evals, custom benchmark builds, kernel-engineering consulting, anything kernel-adjacent.

Reach out: infatoshi@gmail.com