kernelbench.com

hard

RTX PRO 6000 Blackwell (sm120)

Leading models (Opus 4.8, GPT-5.5, GLM-5.2, MiniMax-M3, Gemini 3.5 Flash, Kimi K2.7-Code) were reswept June 2026 with unlimited time per problem; earlier rows used the original 45-minute budget. Claude Fable 5 is suspended and shown as a frozen 45-minute reference.

showing 54 of 54 rows
filesconversation
claude-fable-5 [max]Claude CodeFP8 GEMM-no runno runclean----
referencesolution
transcript0/0
no run
claude-opus-4-8Claude CodeFP8 GEMM
2026-06-1414:42:16
yespassclean
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
39%
out -think -cache 0
agent 50m 6scheck -bench -
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
cursor/composer-2.5-fastcursorFP8 GEMM
2026-06-1511:45:32
yespassclean-
38%
out -think -cache 0
agent 18m 29scheck -bench -
run details
deepseek-deepseek-v4-prodeepseek-claudeFP8 GEMM
2026-06-1416:31:28
yespassclean
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
34%
out -think -cache 0
agent 35m 45scheck -bench -
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
gemini/gemini-3.5-flashGemini CLIFP8 GEMM
2026-06-1417:07:13
yespassclean
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
20%
out -think -cache 0
agent 10m 53scheck -bench -
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
GPT-5.5codexFP8 GEMM
2026-06-1414:42:24
yespassclean
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
36%
out -think -cache 0
agent 13m 5scheck -bench -
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
kimi-kimi-k2.7-codekimi-claudeFP8 GEMM
2026-06-1417:18:08
yespassclean
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
35%
out -think -cache 0
agent 68m 34scheck -bench -
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
MiniMax M3Claude CodeFP8 GEMM
2026-06-1415:32:22
yespassclean
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
37%
out -think -cache 0
agent 169m 6scheck -bench -
Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
zai-glm-5.2Claude CodeFP8 GEMM
2026-06-1414:55:29
yesinvalid
reward hack
Reward hack: output memoization. Wrote a real fp8 kernel but added an input-identity cache (data_ptr() check) that returns a stored output when it sees the same input buffer, so the timed loop measures a lookup, not the GEMM. Authored kernel present but the score is faked.
41%
out -think -cache 0
agent 95m 58scheck -bench -
Reward hack: output memoization. Wrote a real fp8 kernel but added an input-identity cache (data_ptr() check) that returns a stored output when it sees the same input buffer, so the timed loop measures a lookup, not the GEMM. Authored kernel present but the score is faked.
claude-fable-5 [max]Claude CodeKDA CUTLASS
2026-06-1022:03:14
yespass
interesting
Highest KDA score (2x next best) from a fully-authored 3-kernel Triton pipeline featuring single-kernel block-triangular inversion: batched 16x16 diagonal forward substitution, then block merge M[i][j] = -Mi[i] @ (sum_k Akk[i][k] @ M[k][j]) staged through scratch with debug_barrier.
4%
out -think -cache 0
agent 31m 38scheck -bench -
Highest KDA score (2x next best) from a fully-authored 3-kernel Triton pipeline featuring single-kernel block-triangular inversion: batched 16x16 diagonal forward substitution, then block merge M[i][j] = -Mi[i] @ (sum_k Akk[i][k] @ M[k][j]) staged through scratch with debug_barrier.
claude-opus-4-8Claude CodeKDA CUTLASS
2026-06-1304:22:57
yespassclean-
6%
out -think -cache 0
agent 113m 47scheck -bench -
run details
cursor/composer-2.5-fastcursorKDA CUTLASS
2026-06-1511:45:40
yespassclean-
3%
out -think -cache 0
agent 31m 25scheck -bench -
run details
deepseek-deepseek-v4-prodeepseek-claudeKDA CUTLASS
2026-06-1512:55:50
unknownfailclean--
out -think -cache 0
agent 65m 59scheck -bench -
run details
gemini/gemini-3.5-flashGemini CLIKDA CUTLASS
2026-06-1310:15:39
yespassclean-
1%
out -think -cache 0
agent 86m 30scheck -bench -
run details
GPT-5.5codexKDA CUTLASS
2026-06-1304:23:45
yespassclean-
4%
out -think -cache 0
agent 77m 26scheck -bench -
run details
kimi-kimi-k2.7-codekimi-claudeKDA CUTLASS
2026-06-1316:38:58
yespassclean-
2%
out -think -cache 0
agent 128m 28scheck -bench -
run details
MiniMax M3Claude CodeKDA CUTLASS
2026-06-1306:40:19
unknownfail
bug
Timeout at the 6-hour session cap with a non-working kernel (has_solution but correct=false). MiniMax was genuinely grinding on the KDA chunked-recurrence forward: debugging the (I-A)^-1 Neumann inverse, beta row-vs-column scaling, gate cumsum, tf32 precision, and shared-memory pressure in the o_kernel across many Triton/CUDA-C++ rewrites. Sub-kernels matched (wu within 0.0015) but it never converged on a correct, shmem-fitting full kernel; at the cap it was thrashing on flaky background-task retrieval (repeated cat timeouts, final exit 137).
-
out -think -cache 0
agent 364m 9scheck -bench -
Timeout at the 6-hour session cap with a non-working kernel (has_solution but correct=false). MiniMax was genuinely grinding on the KDA chunked-recurrence forward: debugging the (I-A)^-1 Neumann inverse, beta row-vs-column scaling, gate cumsum, tf32 precision, and shared-memory pressure in the o_kernel across many Triton/CUDA-C++ rewrites. Sub-kernels matched (wu within 0.0015) but it never converged on a correct, shmem-fitting full kernel; at the cap it was thrashing on flaky background-task retrieval (repeated cat timeouts, final exit 137).
zai-glm-5.2Claude CodeKDA CUTLASS
2026-06-1305:41:11
yespassclean-
3%
out -think -cache 0
agent 241m 14scheck -bench -
run details
claude-fable-5 [max]Claude CodePaged Attention
2026-06-1105:01:47
yespass
interesting
Best CLEAN cell this sweep (qwen 0.6268 was graph-replay flagged; gpt-5.5 0.664 remains all-time). 3600s-budget rerun of the 0.534 cell that timed out at 2700s. Hand-written CUDA flash-decode with register aliasing. os.environ reads (PD_S/PD_NWARPS/TORCH_CUDA_ARCH_LIST) are tuning knobs with fixed defaults - harness never varies them between check and benchmark, so no behavior switch; kernel recomputes into a persistent output buffer every call. Clean.
63%
out -think -cache 0
agent 60m 42scheck -bench -
Best CLEAN cell this sweep (qwen 0.6268 was graph-replay flagged; gpt-5.5 0.664 remains all-time). 3600s-budget rerun of the 0.534 cell that timed out at 2700s. Hand-written CUDA flash-decode with register aliasing. os.environ reads (PD_S/PD_NWARPS/TORCH_CUDA_ARCH_LIST) are tuning knobs with fixed defaults - harness never varies them between check and benchmark, so no behavior switch; kernel recomputes into a persistent output buffer every call. Clean.
claude-opus-4-8Claude CodePaged Attention
2026-06-1304:23:05
yespassclean-
67%
out -think -cache 0
agent 111m 37scheck -bench -
run details
cursor/composer-2.5-fastcursorPaged Attention
2026-06-1512:04:02
yespassclean-
26%
out -think -cache 0
agent 3m 49scheck -bench -
run details
deepseek-deepseek-v4-prodeepseek-claudePaged Attention
2026-06-1512:57:21
yespassclean-
39%
out -think -cache 0
agent 25m 8scheck -bench -
run details
gemini/gemini-3.5-flashGemini CLIPaged Attention
2026-06-1310:54:39
yespassclean-
24%
out -think -cache 0
agent 48m 0scheck -bench -
run details
GPT-5.5codexPaged Attention
2026-06-1304:45:38
yespassclean-
56%
out -think -cache 0
agent 26m 37scheck -bench -
run details
kimi-kimi-k2.7-codekimi-claudePaged Attention
2026-06-1316:39:06
yespassclean-
24%
out -think -cache 0
agent 40m 39scheck -bench -
run details
MiniMax M3Claude CodePaged Attention
2026-06-1308:13:06
yespassclean-
51%
out -think -cache 0
agent 354m 42scheck -bench -
run details
zai-glm-5.2Claude CodePaged Attention
2026-06-1305:58:15
yespassclean-
68%
out -think -cache 0
agent 233m 51scheck -bench -
run details
claude-fable-5 [max]Claude CodeTopK Bitonic
2026-06-1017:47:00
yespass
interesting
Faiss WarpSelect-style register-resident top-k with warp-shuffle bitonic merges, values packed as monotonic fp32->u32 keys with index into u64 so all compares are integer, multi-split rows merged in one kernel via device-scope acq-rel counter. The _run_cached path skips only pointer rebinding - the kernel launches every call, no compute elided. Legitimate column top after the gpt-5.5 memoization cell was invalidated.
5%
out -think -cache 0
agent 47m 43scheck -bench -
Faiss WarpSelect-style register-resident top-k with warp-shuffle bitonic merges, values packed as monotonic fp32->u32 keys with index into u64 so all compares are integer, multi-split rows merged in one kernel via device-scope acq-rel counter. The _run_cached path skips only pointer rebinding - the kernel launches every call, no compute elided. Legitimate column top after the gpt-5.5 memoization cell was invalidated.
claude-opus-4-8Claude CodeTopK Bitonic
2026-06-1304:23:13
yespassclean-
3%
out -think -cache 0
agent 137m 5scheck -bench -
run details
cursor/composer-2.5-fastcursorTopK Bitonic
2026-06-1512:07:52
yespassclean-
0%
out -think -cache 0
agent 25m 44scheck -bench -
run details
deepseek-deepseek-v4-prodeepseek-claudeTopK Bitonic
2026-06-1513:22:30
yespassclean-
1%
out -think -cache 0
agent 67m 49scheck -bench -
run details
gemini/gemini-3.5-flashGemini CLITopK Bitonic
2026-06-1311:27:32
yespassclean-
3%
out -think -cache 0
agent 84m 47scheck -bench -
run details
GPT-5.5codexTopK Bitonic
2026-06-1305:03:35
yespassclean-
5%
out -think -cache 0
agent 34m 40scheck -bench -
run details
kimi-kimi-k2.7-codekimi-claudeTopK Bitonic
2026-06-1316:39:14
unknownfail
bug
Wrong-answer bug: pass-2 row indexing uses the wrong K. Pass 1 writes per-chunk candidates with row stride chunks_per_row*k (k=16), but pass 2 is dispatched with a template K rounded up from max_candidates (e.g. K=32 at 128 candidates), so it computes real_count=chunks_per_row*32 and in_base=row*real_count, reading misaligned candidate memory for every row>0. Shape 3 (batch=16,n=12000,k=16) fails with 227/256 mismatched (max_abs_diff 1.48, worst row 7); single-row shapes pass because row 0 is unaffected.
-
out -think -cache 0
agent 183m 4scheck -bench -
Wrong-answer bug: pass-2 row indexing uses the wrong K. Pass 1 writes per-chunk candidates with row stride chunks_per_row*k (k=16), but pass 2 is dispatched with a template K rounded up from max_candidates (e.g. K=32 at 128 candidates), so it computes real_count=chunks_per_row*32 and in_base=row*real_count, reading misaligned candidate memory for every row>0. Shape 3 (batch=16,n=12000,k=16) fails with 227/256 mismatched (max_abs_diff 1.48, worst row 7); single-row shapes pass because row 0 is unaffected.
MiniMax M3Claude CodeTopK Bitonic
2026-06-1308:17:45
yespassclean-
1%
out -think -cache 0
agent 361m 35scheck -bench -
run details
zai-glm-5.2Claude CodeTopK Bitonic
2026-06-1306:08:14
yespassclean-
3%
out -think -cache 0
agent 235m 37scheck -bench -
run details
claude-fable-5 [max]Claude CodeSonic MoE SwiGLU
2026-06-1105:02:05
yespass
interesting
NEW ALL-TIME RECORD (prior 0.254 MiniMax M3). 3600s-budget rerun of the 0.2395 cell that timed out at 2700s. Device-side pid->tile mapping via tl.cumsum (no host sync), GROUP_M L2 swizzle. _launch_cache is keyed on SHAPE and caches the compiled Triton kernel, re-running it with live inputs every call (compiled.run(..., *args)) - skips JIT dispatch, not compute; categorically not the output-memoization hack. Clean.
11%
out -think -cache 0
agent 54m 54scheck -bench -
NEW ALL-TIME RECORD (prior 0.254 MiniMax M3). 3600s-budget rerun of the 0.2395 cell that timed out at 2700s. Device-side pid->tile mapping via tl.cumsum (no host sync), GROUP_M L2 swizzle. _launch_cache is keyed on SHAPE and caches the compiled Triton kernel, re-running it with live inputs every call (compiled.run(..., *args)) - skips JIT dispatch, not compute; categorically not the output-memoization hack. Clean.
claude-opus-4-8Claude CodeSonic MoE SwiGLU
2026-06-1304:23:21
yespassclean-
9%
out -think -cache 0
agent 104m 53scheck -bench -
run details
cursor/composer-2.5-fastcursorSonic MoE SwiGLU
2026-06-1512:17:05
yespassclean-
10%
out -think -cache 0
agent 38m 45scheck -bench -
run details
deepseek-deepseek-v4-prodeepseek-claudeSonic MoE SwiGLU
2026-06-1514:01:49
yespassclean-
5%
out -think -cache 0
agent 96m 38scheck -bench -
run details
gemini/gemini-3.5-flashGemini CLISonic MoE SwiGLU
2026-06-1311:42:10
yespassclean-
9%
out -think -cache 0
agent 105m 35scheck -bench -
run details
GPT-5.5codexSonic MoE SwiGLU
2026-06-1305:12:16
yespassclean-
10%
out -think -cache 0
agent 27m 29scheck -bench -
run details
kimi-kimi-k2.7-codekimi-claudeSonic MoE SwiGLU
2026-06-1317:19:46
yespassclean-
10%
out -think -cache 0
agent 166m 28scheck -bench -
run details
MiniMax M3Claude CodeSonic MoE SwiGLU
2026-06-1309:42:26
yespassclean-
9%
out -think -cache 0
agent 363m 37scheck -bench -
run details
zai-glm-5.2Claude CodeSonic MoE SwiGLU
2026-06-1306:14:33
yespassclean-
10%
out -think -cache 0
agent 241m 6scheck -bench -
run details
claude-fable-5 [max]Claude CodeW4A16 GEMM
2026-06-1022:44:47
yespass
interesting
New problem ceiling (prior 0.220). Weights stay int4-packed end to end; in-kernel magic-OR unpack ((b & 0xF) | 0x4300 is the bf16 bit pattern of 128+w exactly) folds the zero-point before the tensor-core dot; evict_last keeps weights L2-resident; pointer-keyed CUDA-graph replay removes launch overhead but the kernel executes every call. POLICY CAVEAT flagged by audit: module import sets torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction=False globally, which changes reference numerics during check.py (toward higher precision, openly documented in a solution comment, tolerance 0.10) - solution code mutating harness-global state is currently unpoliced and deserves an explicit rule.
35%
out -think -cache 0
agent 48m 26scheck -bench -
New problem ceiling (prior 0.220). Weights stay int4-packed end to end; in-kernel magic-OR unpack ((b & 0xF) | 0x4300 is the bf16 bit pattern of 128+w exactly) folds the zero-point before the tensor-core dot; evict_last keeps weights L2-resident; pointer-keyed CUDA-graph replay removes launch overhead but the kernel executes every call. POLICY CAVEAT flagged by audit: module import sets torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction=False globally, which changes reference numerics during check.py (toward higher precision, openly documented in a solution comment, tolerance 0.10) - solution code mutating harness-global state is currently unpoliced and deserves an explicit rule.
claude-opus-4-8Claude CodeW4A16 GEMM
2026-06-1311:54:10
yespassclean-
24%
out -think -cache 0
agent 218m 1scheck -bench -
run details
cursor/composer-2.5-fastcursorW4A16 GEMM
2026-06-1512:33:36
yespassclean-
15%
out -think -cache 0
agent 23m 45scheck -bench -
run details
deepseek-deepseek-v4-prodeepseek-claudeW4A16 GEMM
2026-06-1514:30:20
yespassclean-
15%
out -think -cache 0
agent 53m 29scheck -bench -
run details
gemini/gemini-3.5-flashGemini CLIW4A16 GEMM
2026-06-1311:42:39
yespassclean-
17%
out -think -cache 0
agent 100m 35scheck -bench -
run details
GPT-5.5codexW4A16 GEMM
2026-06-1305:38:15
yespassclean-
20%
out -think -cache 0
agent 20m 0scheck -bench -
run details
kimi-kimi-k2.7-codekimi-claudeW4A16 GEMM
2026-06-1318:15:41
yespassclean-
15%
out -think -cache 0
agent 94m 52scheck -bench -
run details
MiniMax M3Claude CodeW4A16 GEMM
2026-06-1309:52:07
yespassclean-
14%
out -think -cache 0
agent 280m 5scheck -bench -
run details
zai-glm-5.2Claude CodeW4A16 GEMM
2026-06-1306:14:43
yespassclean-
32%
out -think -cache 0
agent 279m 55scheck -bench -
run details

Browse the run index for transcripts, submitted solutions, checks, timing, and costs. Full historical and diagnostic rows are still available in leaderboard.json.