kernelbench.com: Agentic GPU Kernel Benchmark Results and Run Artifacts

Arledge, Elliot

writeups

long-form posts on the design choices, the rubric leaks, the methodology behind each version.

Hard: Eight Scored Problems, Frontier Models, Two Rubric Leaks

A focused successor to the v3 suite. One Blackwell GPU, eight scored CUDA problems, real coding-agent CLIs as the harness. Frontier models swept; only GPT-5.5 xhigh solved every problem. Two problems leak the rubric: five models all took the same bf16 shortcut on FP8 GEMM, and the only model that implemented Kahan compensated summation scored lowest of the passes.

[ archive ]

v3: Rebuilding a GPU Kernel Benchmark from First Principles

How discovering the original KernelBench was exploitable led to building a focused, cost-effective benchmark for evaluating LLM kernel engineering on modern architectures.