kernelbench.com

hard notes

hard benchmark design notes

Background notes for the Hard leaderboard: problem ceilings, known caveats, the FP8 constraint rerun, and what changed from v3. The operational leaderboard lives on /hard.

per-problem ceilings

eager / compiled = PyTorch reference timings. SOTA = the existing best-known kernel for the problem when one exists on this hardware. best peak = the model that pushed furthest above the reference line.

problemeager mscompiled msSOTA msbest peakbest modeln scored
FP8 GEMM0.4720.438-0.386claude/claude-opus-4-87/8
KDA CUTLASS61.8937.425-0.055claude/claude-opus-4-87/9
Paged Attention1.2621.274-0.677zai-claude/glm-5.29/9
TopK Bitonic0.0410.0770.0410.049claude/claude-fable-5 [max]8/9
Sonic MoE SwiGLU9.6889.753-0.108claude/claude-fable-5 [max]9/9
W4A16 GEMM0.6050.144-0.348claude/claude-fable-5 [max]9/9

rubric caveat

01 fp8_gemm: bf16 dressup

claude-opus-4-7 [max]0.534
mimo-v2.5-pro0.434
qwen3.6-plus0.431
qwen3.6-max-preview0.429
gpt-5.5 [xhigh]0.423

Every passing solution at peak >= 0.4 casts the fp8 inputs to bf16 inside the kernel and runs a bf16 GEMM. The peak fractions on this row reflect bf16 GEMM optimization quality on fp8-typed inputs, not FP8 tensor core skill.

fp8 constraint rerun

On June 5, 2026, the FP8 GEMM verifier was tightened to reject the bf16-dressup shortcut and require an FP8-looking execution path. Once the shortcut was blocked, every available model either failed correctness, failed the provider path, or could not run because of credits/key issues.

fixed-tolerance rerun

modelrouteoutcomeelapsednote
Claude Opus 4.6claudeFAIL18.5mlarge_input stress failed, max_abs_diff=4
Claude Opus 4.7claudeFAIL21.6mcheck_failed under real FP8 constraint
Claude Opus 4.8claudeFAIL41.8mlarge_input K=4127 failed, max_abs_diff=4
GPT-5.5codexFAIL6.8mnominal tolerance failed on first fixed run
DeepSeek V4 FlashopencodeFAIL4.1mnominal tolerance failed, max_abs_diff around 0.53
DeepSeek V4 ProopencodeFAIL5.9mfirst run had Triton fp8 load cast error
OpenCode GLM-5.1opencodeEARLY11.5mprovider early-stop/no solution on opencode route
Kimi K2.6kimiERR4sinvalid or expired API key
MiniMax/Qwen/MiMo via OpenRouteropencodeERR1-2sprovider_insufficient_credits

recovery smokes

modelrouteoutcomeelapsednote
GLM-5.1zai-claudeFAIL11.0mdirect ZAI route worked, nominal max_abs_diff=0.5625
DeepSeek V4 ProopencodeFAIL9.8msecond attempt reached verifier, nominal max_abs_diff=0.539
DeepSeek V4 FlashopencodeFAIL3.2msecond attempt reached verifier, nominal max_abs_diff=0.539
GPT-5.5codexFAIL8.1mTriton resource failure: 147456B shared memory > 101376B limit
Stacked token burn for FP8 constraint rerun
Token burn by model on the FP8 constraint run.
Tokens versus effective peak for FP8 constraint rerun
All effective peaks collapse to zero under the strict verifier.
Cost before outcome for FP8 constraint rerun
Spend and wall time before each failing outcome.

what changed from v3

  • One GPU instead of three: RTX PRO 6000 Blackwell.
  • A smaller hand-designed problem deck instead of 43-58 problems per GPU.
  • Real coding-agent CLIs as the harness: claude code, codex, opencode, droid, kimi, cursor, gemini-cli, grok build.
  • Wall-clock budgets, not turn limits.
  • peak_fraction grounded in physical hardware ceilings instead of raw speedup ratios.
  • Per-cell annotations with verdict, quotes from solution.py, and implication notes.