writeups
long-form posts on the design choices, the rubric leaks, the methodology behind each version.
[ latest ]
KernelBench-Hard: Seven Problems, Twelve Frontier Models, Two Rubric Leaks
A focused successor to KernelBench v3. One Blackwell GPU, seven hand-designed problems, real coding-agent CLIs as the harness. Twelve frontier models swept; only GPT-5.5 xhigh solved every problem. Two of the seven problems leak the rubric — five models all took the same bf16 shortcut on FP8 GEMM, and the only model that implemented Kahan compensated summation scored lowest of the seven passes.
[ archive ]
KernelBench v3: Rebuilding a GPU Kernel Benchmark from First Principles
How discovering the original KernelBench was exploitable led to building a focused, cost-effective benchmark for evaluating LLM kernel engineering on modern architectures.