./kernelbench

kernelbench v3

10 models · RTX 3090 + H100 + B200 · 4 difficulty levels · 2071 evaluations

The previous-generation benchmark. After METR's “Measuring Automated Kernel Engineering” paper showed the original Stanford KernelBench was riddled with exploits (no-op kernels passing via memory aliasing, models monkey-patching torch.cuda.synchronize, constant functions like mean(softmax(x)) == 1.0), v3 was rebuilt from scratch with adaptive baselines, multi-seed correctness, modern architectures (DeepSeek MLA, MoE, FP8/INT4 GEMM, GatedDeltaNet), and tracked cost per evaluation.

# results

overall pass rates
overall pass rates
per-level heatmap
per-level heatmap
speedup distribution
speedup distribution
per-level breakdown
per-level breakdown
cost vs accuracy
cost vs accuracy
compilation funnel
compilation funnel

# explorer

loading...

modelgpulevelproblemcorrectspeedupturnstotal tokensestimated cost usdcode