./kernelbench

kernelbench hard

12 models × 7 problems · RTX PRO 6000 Blackwell · sm_120 · 96 GB GDDR7 · 1.8 TB/s

A focused successor to KernelBench v3. One Blackwell GPU, seven hand-designed problems, real coding-agent CLIs as the harness. Twelve frontier models swept; only GPT-5.5 xhigh solved every problem. Two of the seven problems leak the rubric — five models all took the same bf16 shortcut on FP8 GEMM, and the only model that implemented Kahan compensated summation scored lowest of the seven passes.

# leaderboard

cells = peak_fraction (fraction of the relevant hardware ceiling). FAIL = solution written but missed correctness. ERR = no solution produced. = annotation attached. click any cell to open the full transcript viewer — every tool call, every reasoning step, the solution.py, the check.log.

model01 fp802 kda03 paged04 kahan05 topk06 moe07 w4a16PASS
gpt-5.5 [xhigh] [xhigh]0.4230.0320.4970.3630.0420.2510.1597/7
claude-opus-4-7 [max] [max]0.534PASS0.6020.3170.020FAIL0.1846/7
kimi-k2.6FAIL0.0220.4320.1180.0140.1610.2206/7
or/xiaomi/mimo-v2.5-pro0.434FAILERR0.1210.0170.2110.1375/7
or/qwen/qwen3.6-max-preview0.4290.011ERR0.077FAIL0.0040.1105/7
deepseek/deepseek-v4-flashFAIL0.0090.1670.138FAIL0.0830.1345/7
deepseek/deepseek-v4-proFAILFAIL0.0270.1010.0110.1080.1255/7
or/qwen/qwen3.6-plus0.431ERR0.022ERRFAIL0.0400.1254/7
zai/glm-5.1FAIL0.005ERR0.125ERR0.2380.1804/7
or/minimax/minimax-m2.7ERRERRFAIL0.034FAIL0.0760.0303/7
or/qwen/qwen3.6-27bERRFAILFAILERRFAIL0.082ERR1/7
or/qwen/qwen3.6-35b-a3bERRERRERRERRERRERRERR0/7

# per-problem ceilings

eager / compiled = PyTorch reference timings. SOTA = the existing best-known kernel for the problem (vLLM paged attention, fbgemm grouped GEMM, etc.) when one exists on this hardware. best peak = the model that pushed furthest above the reference line.

problemeager mscompiled msSOTA msbest peakbest modeln pass
01_fp8_gemm0.4720.4380.534claude-opus-4-7 [max]5/12
02_kda_cutlass61.8937.4250.032gpt-5.5 [xhigh]6/12
03_paged_attention1.2621.2740.602claude-opus-4-7 [max]6/12
04_kahan_softmax0.0700.1800.0760.363gpt-5.5 [xhigh]9/12
05_topk_bitonic0.0410.0770.0410.042gpt-5.5 [xhigh]5/12
06_sonic_moe_swiglu9.6889.7530.251gpt-5.5 [xhigh]10/12
07_w4a16_gemm0.6050.1440.220kimi-k2.610/12

# rubric leaks

Two cells in the leaderboard promise something the benchmark doesn't actually measure. They're marked for a reason.

01 fp8_gemm — bf16 dressup

claude-opus-4-7 [max]0.534
mimo-v2.5-pro0.434
qwen3.6-plus0.431
qwen3.6-max-preview0.429
gpt-5.5 [xhigh]0.423

Every passing solution at peak ≥ 0.4 casts the fp8 inputs to bf16 inside the kernel and runs a bf16 GEMM. Both Opus 4.7 max and GPT-5.5 xhigh explicitly pin to cutlass::arch::Sm80 — Ampere CUTLASS, not the SM120 Blackwell FP8 tensor cores the problem name implies. Opus's source comment is unusually direct: “follow the codex baseline (BF16 GEMM internally)...” The peak fractions on this row reflect bf16 GEMM optimization quality on fp8-typed inputs, not FP8 tensor core skill.

04 kahan_softmax — Kahan compensation skipped

gpt-5.5 [xhigh]0.363
claude-opus-4-7 [max]0.317
deepseek-v4-flash0.138
glm-5.10.125
mimo-v2.5-pro0.121
kimi-k2.60.118

Six of seven passing solutions skipped the Kahan compensated summation entirely. Only deepseek-v4-pro — the lowest passing peak at 0.101 — actually implemented the algorithm the problem name describes. Compensated summation has real overhead, naive softmax fits within tolerance, and the rubric leaks. The model whose docstring explicitly states “Numerically tight softmax with Kahan compensated summation” is the model that loses the cell.

Both leaks are fixable in a few hours of problem-design work. Publishing without fixing because (a) every iteration surfaces the next leak — diminishing returns, and (b) the leaks ARE the finding. “Five frontier models all took the same bf16 shortcut on FP8 GEMM” is itself a headline.

# what changed from v3

  • >One GPU instead of three. RTX PRO 6000 Blackwell (sm_120, 96 GB GDDR7, 1.8 TB/s).
  • >Seven hand-designed problems instead of 43-58. Per-trial L2 flush, 30-trial median, 10 warmup absorbing torch.compile CUDA-graph capture and Triton autotune.
  • >Real coding-agent CLIs as the harness — Claude Code, codex CLI, Kimi CLI, opencode — not a custom KernelBench agent loop.
  • >Wall-clock budgets, not turn limits. 45 min/run.
  • >peak_fraction grounded in physical hardware ceilings instead of raw speedup ratios.
  • >Per-cell annotations with verdict, pull quotes from solution.py, and an “implication” statement. 13 annotations as of launch.
Source data: github.com/Infatoshi/KernelBench-Hard · leaderboard.json · annotations/ · DEVLOG.md