kernelbench hard

12 models × 7 problems · RTX PRO 6000 Blackwell · sm_120 · 96 GB GDDR7 · 1.8 TB/s

A focused successor to KernelBench v3. One Blackwell GPU, seven hand-designed problems, real coding-agent CLIs as the harness. Twelve frontier models swept; only GPT-5.5 xhigh solved every problem. Two of the seven problems leak the rubric — five models all took the same bf16 shortcut on FP8 GEMM, and the only model that implemented Kahan compensated summation scored lowest of the seven passes.

# leaderboard

cells = peak_fraction (fraction of the relevant hardware ceiling). FAIL = solution written but missed correctness. ERR = no solution produced. ★ = annotation attached. click any cell to open the full transcript viewer — every tool call, every reasoning step, the solution.py, the check.log.

model	01 fp8	02 kda	03 paged	04 kahan	05 topk	06 moe	07 w4a16	PASS
gpt-5.5 [xhigh] [xhigh]	0.423★	0.032	0.497★	0.363★	0.042	0.251★	0.159★	7/7
claude-opus-4-7 [max] [max]	0.534★	PASS	0.602★	0.317★	0.020	FAIL	0.184★	6/7
kimi-k2.6	FAIL	0.022	0.432★	0.118★	0.014	0.161★	0.220★	6/7
or/xiaomi/mimo-v2.5-pro	0.434★	FAIL	ERR	0.121★	0.017	0.211★	0.137★	5/7
or/qwen/qwen3.6-max-preview	0.429★	0.011	ERR	0.077	FAIL	0.004	0.110★	5/7
deepseek/deepseek-v4-flash	FAIL	0.009	0.167★	0.138★	FAIL	0.083	0.134★	5/7
deepseek/deepseek-v4-pro	FAIL	FAIL	0.027	0.101★	0.011	0.108★	0.125★	5/7
or/qwen/qwen3.6-plus	0.431★	ERR	0.022	ERR	FAIL	0.040	0.125★	4/7
zai/glm-5.1	FAIL	0.005	ERR	0.125★	ERR	0.238★	0.180★	4/7
or/minimax/minimax-m2.7	ERR	ERR	FAIL	0.034	FAIL	0.076	0.030	3/7
or/qwen/qwen3.6-27b	ERR	FAIL	FAIL	ERR	FAIL	0.082	ERR	1/7
or/qwen/qwen3.6-35b-a3b	ERR	ERR	ERR	ERR	ERR	ERR	ERR	0/7

# per-problem ceilings

eager / compiled = PyTorch reference timings. SOTA = the existing best-known kernel for the problem (vLLM paged attention, fbgemm grouped GEMM, etc.) when one exists on this hardware. best peak = the model that pushed furthest above the reference line.

problem	eager ms	compiled ms	SOTA ms	best peak	best model	n pass
01_fp8_gemm	0.472	0.438	—	0.534	claude-opus-4-7 [max]	5/12
02_kda_cutlass	61.893	7.425	—	0.032	gpt-5.5 [xhigh]	6/12
03_paged_attention	1.262	1.274	—	0.602	claude-opus-4-7 [max]	6/12
04_kahan_softmax	0.070	0.180	0.076	0.363	gpt-5.5 [xhigh]	9/12
05_topk_bitonic	0.041	0.077	0.041	0.042	gpt-5.5 [xhigh]	5/12
06_sonic_moe_swiglu	9.688	9.753	—	0.251	gpt-5.5 [xhigh]	10/12
07_w4a16_gemm	0.605	0.144	—	0.220	kimi-k2.6	10/12

# rubric leaks

Two cells in the leaderboard promise something the benchmark doesn't actually measure. They're marked ★ for a reason.

★ 01 fp8_gemm — bf16 dressup

claude-opus-4-7 [max]0.534

mimo-v2.5-pro0.434

qwen3.6-plus0.431

qwen3.6-max-preview0.429

gpt-5.5 [xhigh]0.423

Every passing solution at peak ≥ 0.4 casts the fp8 inputs to bf16 inside the kernel and runs a bf16 GEMM. Both Opus 4.7 max and GPT-5.5 xhigh explicitly pin to cutlass::arch::Sm80 — Ampere CUTLASS, not the SM120 Blackwell FP8 tensor cores the problem name implies. Opus's source comment is unusually direct: “follow the codex baseline (BF16 GEMM internally)...” The peak fractions on this row reflect bf16 GEMM optimization quality on fp8-typed inputs, not FP8 tensor core skill.

★ 04 kahan_softmax — Kahan compensation skipped

gpt-5.5 [xhigh]0.363

claude-opus-4-7 [max]0.317

deepseek-v4-flash0.138

glm-5.10.125

mimo-v2.5-pro0.121

kimi-k2.60.118

Six of seven passing solutions skipped the Kahan compensated summation entirely. Only deepseek-v4-pro — the lowest passing peak at 0.101 — actually implemented the algorithm the problem name describes. Compensated summation has real overhead, naive softmax fits within tolerance, and the rubric leaks. The model whose docstring explicitly states “Numerically tight softmax with Kahan compensated summation” is the model that loses the cell.

Both leaks are fixable in a few hours of problem-design work. Publishing without fixing because (a) every iteration surfaces the next leak — diminishing returns, and (b) the leaks ARE the finding. “Five frontier models all took the same bf16 shortcut on FP8 GEMM” is itself a headline.

# what changed from v3

>One GPU instead of three. RTX PRO 6000 Blackwell (sm_120, 96 GB GDDR7, 1.8 TB/s).
>Seven hand-designed problems instead of 43-58. Per-trial L2 flush, 30-trial median, 10 warmup absorbing torch.compile CUDA-graph capture and Triton autotune.
>Real coding-agent CLIs as the harness — Claude Code, codex CLI, Kimi CLI, opencode — not a custom KernelBench agent loop.
>Wall-clock budgets, not turn limits. 45 min/run.
>peak_fraction grounded in physical hardware ceilings instead of raw speedup ratios.
>Per-cell annotations with verdict, pull quotes from solution.py, and an “implication” statement. 13 annotations as of launch.

Source data: github.com/Infatoshi/KernelBench-Hard · leaderboard.json · annotations/ · DEVLOG.md