kernelbench v3

10 models · RTX 3090 + H100 + B200 · 4 difficulty levels · 2071 evaluations

The previous-generation benchmark. After METR's “Measuring Automated Kernel Engineering” paper showed the original Stanford KernelBench was riddled with exploits (no-op kernels passing via memory aliasing, models monkey-patching torch.cuda.synchronize, constant functions like mean(softmax(x)) == 1.0), v3 was rebuilt from scratch with adaptive baselines, multi-seed correctness, modern architectures (DeepSeek MLA, MoE, FP8/INT4 GEMM, GatedDeltaNet), and tracked cost per evaluation.

# results

# explorer

model

gpu

level

status

op_type

baseline_type

model ↑	gpu	level	problem	correct	speedup	turns	total tokens	estimated cost usd	code