kernelbench.com: Agentic GPU Kernel Benchmark Results and Run Artifacts

Arledge, Elliot

hard

RTX PRO 6000 Blackwell (sm120)

Leading models (Opus 4.8, GPT-5.5, GLM-5.2, MiniMax-M3, Gemini 3.5 Flash, Kimi K2.7-Code) were reswept June 2026 with unlimited time per problem; earlier rows used the original 45-minute budget. Claude Fable 5 is suspended and shown as a frozen 45-minute reference.

problemharnessoutcome

showing 54 of 54 rows

											files	conversation
claude-fable-5 [max]	Claude Code	FP8 GEMM	-	no run	no run	clean	-	-	-	-	referencesolution	transcript0/0 no run
claude-opus-4-8	Claude Code	FP8 GEMM	2026-06-1414:42:16	yes	pass	clean	Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.	39%	out -think -cache 0	agent 50m 6scheck -bench -	reference solution	transcript6/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
cursor/composer-2.5-fast	cursor	FP8 GEMM	2026-06-1511:45:32	yes	pass	clean	-	38%	out -think -cache 0	agent 18m 29scheck -bench -	reference solution	transcript6/6 run details
deepseek-deepseek-v4-pro	deepseek-claude	FP8 GEMM	2026-06-1416:31:28	yes	pass	clean	Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.	34%	out -think -cache 0	agent 35m 45scheck -bench -	reference solution	transcript5/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
gemini/gemini-3.5-flash	Gemini CLI	FP8 GEMM	2026-06-1417:07:13	yes	pass	clean	Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.	20%	out -think -cache 0	agent 10m 53scheck -bench -	reference solution	transcript6/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
GPT-5.5	codex	FP8 GEMM	2026-06-1414:42:24	yes	pass	clean	Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.	36%	out -think -cache 0	agent 13m 5scheck -bench -	reference solution	transcript6/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
kimi-kimi-k2.7-code	kimi-claude	FP8 GEMM	2026-06-1417:18:08	yes	pass	clean	Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.	35%	out -think -cache 0	agent 68m 34scheck -bench -	reference solution	transcript5/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
MiniMax M3	Claude Code	FP8 GEMM	2026-06-1415:32:22	yes	pass	clean	Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.	37%	out -think -cache 0	agent 169m 6scheck -bench -	reference solution	transcript5/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast.
zai-glm-5.2	Claude Code	FP8 GEMM	2026-06-1414:55:29	yes	invalid	reward hack	Reward hack: output memoization. Wrote a real fp8 kernel but added an input-identity cache (data_ptr() check) that returns a stored output when it sees the same input buffer, so the timed loop measures a lookup, not the GEMM. Authored kernel present but the score is faked.	41%	out -think -cache 0	agent 95m 58scheck -bench -	reference solution	transcript5/6 Reward hack: output memoization. Wrote a real fp8 kernel but added an input-identity cache (data_ptr() check) that returns a stored output when it sees the same input buffer, so the timed loop measures a lookup, not the GEMM. Authored kernel present but the score is faked.
claude-fable-5 [max]	Claude Code	KDA CUTLASS	2026-06-1022:03:14	yes	pass	interesting	Highest KDA score (2x next best) from a fully-authored 3-kernel Triton pipeline featuring single-kernel block-triangular inversion: batched 16x16 diagonal forward substitution, then block merge M[i][j] = -Mi[i] @ (sum_k Akk[i][k] @ M[k][j]) staged through scratch with debug_barrier.	4%	out -think -cache 0	agent 31m 38scheck -bench -	reference solution	transcript5/5 Highest KDA score (2x next best) from a fully-authored 3-kernel Triton pipeline featuring single-kernel block-triangular inversion: batched 16x16 diagonal forward substitution, then block merge M[i][j] = -Mi[i] @ (sum_k Akk[i][k] @ M[k][j]) staged through scratch with debug_barrier.
claude-opus-4-8	Claude Code	KDA CUTLASS	2026-06-1304:22:57	yes	pass	clean	-	6%	out -think -cache 0	agent 113m 47scheck -bench -	reference solution	transcript6/6 run details
cursor/composer-2.5-fast	cursor	KDA CUTLASS	2026-06-1511:45:40	yes	pass	clean	-	3%	out -think -cache 0	agent 31m 25scheck -bench -	reference solution	transcript6/6 run details
deepseek-deepseek-v4-pro	deepseek-claude	KDA CUTLASS	2026-06-1512:55:50	unknown	fail	clean	-	-	out -think -cache 0	agent 65m 59scheck -bench -	reference solution	transcript5/6 run details
gemini/gemini-3.5-flash	Gemini CLI	KDA CUTLASS	2026-06-1310:15:39	yes	pass	clean	-	1%	out -think -cache 0	agent 86m 30scheck -bench -	reference solution	transcript6/6 run details
GPT-5.5	codex	KDA CUTLASS	2026-06-1304:23:45	yes	pass	clean	-	4%	out -think -cache 0	agent 77m 26scheck -bench -	reference solution	transcript6/6 run details
kimi-kimi-k2.7-code	kimi-claude	KDA CUTLASS	2026-06-1316:38:58	yes	pass	clean	-	2%	out -think -cache 0	agent 128m 28scheck -bench -	reference solution	transcript5/6 run details
MiniMax M3	Claude Code	KDA CUTLASS	2026-06-1306:40:19	unknown	fail	bug	Timeout at the 6-hour session cap with a non-working kernel (has_solution but correct=false). MiniMax was genuinely grinding on the KDA chunked-recurrence forward: debugging the (I-A)^-1 Neumann inverse, beta row-vs-column scaling, gate cumsum, tf32 precision, and shared-memory pressure in the o_kernel across many Triton/CUDA-C++ rewrites. Sub-kernels matched (wu within 0.0015) but it never converged on a correct, shmem-fitting full kernel; at the cap it was thrashing on flaky background-task retrieval (repeated cat timeouts, final exit 137).	-	out -think -cache 0	agent 364m 9scheck -bench -	reference solution	transcript5/6 Timeout at the 6-hour session cap with a non-working kernel (has_solution but correct=false). MiniMax was genuinely grinding on the KDA chunked-recurrence forward: debugging the (I-A)^-1 Neumann inverse, beta row-vs-column scaling, gate cumsum, tf32 precision, and shared-memory pressure in the o_kernel across many Triton/CUDA-C++ rewrites. Sub-kernels matched (wu within 0.0015) but it never converged on a correct, shmem-fitting full kernel; at the cap it was thrashing on flaky background-task retrieval (repeated cat timeouts, final exit 137).
zai-glm-5.2	Claude Code	KDA CUTLASS	2026-06-1305:41:11	yes	pass	clean	-	3%	out -think -cache 0	agent 241m 14scheck -bench -	reference solution	transcript5/6 run details
claude-fable-5 [max]	Claude Code	Paged Attention	2026-06-1105:01:47	yes	pass	interesting	Best CLEAN cell this sweep (qwen 0.6268 was graph-replay flagged; gpt-5.5 0.664 remains all-time). 3600s-budget rerun of the 0.534 cell that timed out at 2700s. Hand-written CUDA flash-decode with register aliasing. os.environ reads (PD_S/PD_NWARPS/TORCH_CUDA_ARCH_LIST) are tuning knobs with fixed defaults - harness never varies them between check and benchmark, so no behavior switch; kernel recomputes into a persistent output buffer every call. Clean.	63%	out -think -cache 0	agent 60m 42scheck -bench -	reference solution	transcript5/5 Best CLEAN cell this sweep (qwen 0.6268 was graph-replay flagged; gpt-5.5 0.664 remains all-time). 3600s-budget rerun of the 0.534 cell that timed out at 2700s. Hand-written CUDA flash-decode with register aliasing. os.environ reads (PD_S/PD_NWARPS/TORCH_CUDA_ARCH_LIST) are tuning knobs with fixed defaults - harness never varies them between check and benchmark, so no behavior switch; kernel recomputes into a persistent output buffer every call. Clean.
claude-opus-4-8	Claude Code	Paged Attention	2026-06-1304:23:05	yes	pass	clean	-	67%	out -think -cache 0	agent 111m 37scheck -bench -	reference solution	transcript6/6 run details
cursor/composer-2.5-fast	cursor	Paged Attention	2026-06-1512:04:02	yes	pass	clean	-	26%	out -think -cache 0	agent 3m 49scheck -bench -	reference solution	transcript6/6 run details
deepseek-deepseek-v4-pro	deepseek-claude	Paged Attention	2026-06-1512:57:21	yes	pass	clean	-	39%	out -think -cache 0	agent 25m 8scheck -bench -	reference solution	transcript5/6 run details
gemini/gemini-3.5-flash	Gemini CLI	Paged Attention	2026-06-1310:54:39	yes	pass	clean	-	24%	out -think -cache 0	agent 48m 0scheck -bench -	reference solution	transcript6/6 run details
GPT-5.5	codex	Paged Attention	2026-06-1304:45:38	yes	pass	clean	-	56%	out -think -cache 0	agent 26m 37scheck -bench -	reference solution	transcript6/6 run details
kimi-kimi-k2.7-code	kimi-claude	Paged Attention	2026-06-1316:39:06	yes	pass	clean	-	24%	out -think -cache 0	agent 40m 39scheck -bench -	reference solution	transcript5/6 run details
MiniMax M3	Claude Code	Paged Attention	2026-06-1308:13:06	yes	pass	clean	-	51%	out -think -cache 0	agent 354m 42scheck -bench -	reference solution	transcript5/6 run details
zai-glm-5.2	Claude Code	Paged Attention	2026-06-1305:58:15	yes	pass	clean	-	68%	out -think -cache 0	agent 233m 51scheck -bench -	reference solution	transcript5/6 run details
claude-fable-5 [max]	Claude Code	TopK Bitonic	2026-06-1017:47:00	yes	pass	interesting	Faiss WarpSelect-style register-resident top-k with warp-shuffle bitonic merges, values packed as monotonic fp32->u32 keys with index into u64 so all compares are integer, multi-split rows merged in one kernel via device-scope acq-rel counter. The _run_cached path skips only pointer rebinding - the kernel launches every call, no compute elided. Legitimate column top after the gpt-5.5 memoization cell was invalidated.	5%	out -think -cache 0	agent 47m 43scheck -bench -	reference solution	transcript5/5 Faiss WarpSelect-style register-resident top-k with warp-shuffle bitonic merges, values packed as monotonic fp32->u32 keys with index into u64 so all compares are integer, multi-split rows merged in one kernel via device-scope acq-rel counter. The _run_cached path skips only pointer rebinding - the kernel launches every call, no compute elided. Legitimate column top after the gpt-5.5 memoization cell was invalidated.
claude-opus-4-8	Claude Code	TopK Bitonic	2026-06-1304:23:13	yes	pass	clean	-	3%	out -think -cache 0	agent 137m 5scheck -bench -	reference solution	transcript6/6 run details
cursor/composer-2.5-fast	cursor	TopK Bitonic	2026-06-1512:07:52	yes	pass	clean	-	0%	out -think -cache 0	agent 25m 44scheck -bench -	reference solution	transcript6/6 run details
deepseek-deepseek-v4-pro	deepseek-claude	TopK Bitonic	2026-06-1513:22:30	yes	pass	clean	-	1%	out -think -cache 0	agent 67m 49scheck -bench -	reference solution	transcript5/6 run details
gemini/gemini-3.5-flash	Gemini CLI	TopK Bitonic	2026-06-1311:27:32	yes	pass	clean	-	3%	out -think -cache 0	agent 84m 47scheck -bench -	reference solution	transcript6/6 run details
GPT-5.5	codex	TopK Bitonic	2026-06-1305:03:35	yes	pass	clean	-	5%	out -think -cache 0	agent 34m 40scheck -bench -	reference solution	transcript6/6 run details
kimi-kimi-k2.7-code	kimi-claude	TopK Bitonic	2026-06-1316:39:14	unknown	fail	bug	Wrong-answer bug: pass-2 row indexing uses the wrong K. Pass 1 writes per-chunk candidates with row stride chunks_per_rowk (k=16), but pass 2 is dispatched with a template K rounded up from max_candidates (e.g. K=32 at 128 candidates), so it computes real_count=chunks_per_row32 and in_base=row*real_count, reading misaligned candidate memory for every row>0. Shape 3 (batch=16,n=12000,k=16) fails with 227/256 mismatched (max_abs_diff 1.48, worst row 7); single-row shapes pass because row 0 is unaffected.	-	out -think -cache 0	agent 183m 4scheck -bench -	reference solution	transcript5/6 Wrong-answer bug: pass-2 row indexing uses the wrong K. Pass 1 writes per-chunk candidates with row stride chunks_per_rowk (k=16), but pass 2 is dispatched with a template K rounded up from max_candidates (e.g. K=32 at 128 candidates), so it computes real_count=chunks_per_row32 and in_base=row*real_count, reading misaligned candidate memory for every row>0. Shape 3 (batch=16,n=12000,k=16) fails with 227/256 mismatched (max_abs_diff 1.48, worst row 7); single-row shapes pass because row 0 is unaffected.
MiniMax M3	Claude Code	TopK Bitonic	2026-06-1308:17:45	yes	pass	clean	-	1%	out -think -cache 0	agent 361m 35scheck -bench -	reference solution	transcript5/6 run details
zai-glm-5.2	Claude Code	TopK Bitonic	2026-06-1306:08:14	yes	pass	clean	-	3%	out -think -cache 0	agent 235m 37scheck -bench -	reference solution	transcript5/6 run details
claude-fable-5 [max]	Claude Code	Sonic MoE SwiGLU	2026-06-1105:02:05	yes	pass	interesting	NEW ALL-TIME RECORD (prior 0.254 MiniMax M3). 3600s-budget rerun of the 0.2395 cell that timed out at 2700s. Device-side pid->tile mapping via tl.cumsum (no host sync), GROUP_M L2 swizzle. _launch_cache is keyed on SHAPE and caches the compiled Triton kernel, re-running it with live inputs every call (compiled.run(..., *args)) - skips JIT dispatch, not compute; categorically not the output-memoization hack. Clean.	11%	out -think -cache 0	agent 54m 54scheck -bench -	reference solution	transcript5/5 NEW ALL-TIME RECORD (prior 0.254 MiniMax M3). 3600s-budget rerun of the 0.2395 cell that timed out at 2700s. Device-side pid->tile mapping via tl.cumsum (no host sync), GROUP_M L2 swizzle. _launch_cache is keyed on SHAPE and caches the compiled Triton kernel, re-running it with live inputs every call (compiled.run(..., *args)) - skips JIT dispatch, not compute; categorically not the output-memoization hack. Clean.
claude-opus-4-8	Claude Code	Sonic MoE SwiGLU	2026-06-1304:23:21	yes	pass	clean	-	9%	out -think -cache 0	agent 104m 53scheck -bench -	reference solution	transcript6/6 run details
cursor/composer-2.5-fast	cursor	Sonic MoE SwiGLU	2026-06-1512:17:05	yes	pass	clean	-	10%	out -think -cache 0	agent 38m 45scheck -bench -	reference solution	transcript6/6 run details
deepseek-deepseek-v4-pro	deepseek-claude	Sonic MoE SwiGLU	2026-06-1514:01:49	yes	pass	clean	-	5%	out -think -cache 0	agent 96m 38scheck -bench -	reference solution	transcript5/6 run details
gemini/gemini-3.5-flash	Gemini CLI	Sonic MoE SwiGLU	2026-06-1311:42:10	yes	pass	clean	-	9%	out -think -cache 0	agent 105m 35scheck -bench -	reference solution	transcript6/6 run details
GPT-5.5	codex	Sonic MoE SwiGLU	2026-06-1305:12:16	yes	pass	clean	-	10%	out -think -cache 0	agent 27m 29scheck -bench -	reference solution	transcript6/6 run details
kimi-kimi-k2.7-code	kimi-claude	Sonic MoE SwiGLU	2026-06-1317:19:46	yes	pass	clean	-	10%	out -think -cache 0	agent 166m 28scheck -bench -	reference solution	transcript5/6 run details
MiniMax M3	Claude Code	Sonic MoE SwiGLU	2026-06-1309:42:26	yes	pass	clean	-	9%	out -think -cache 0	agent 363m 37scheck -bench -	reference solution	transcript5/6 run details
zai-glm-5.2	Claude Code	Sonic MoE SwiGLU	2026-06-1306:14:33	yes	pass	clean	-	10%	out -think -cache 0	agent 241m 6scheck -bench -	reference solution	transcript5/6 run details
claude-fable-5 [max]	Claude Code	W4A16 GEMM	2026-06-1022:44:47	yes	pass	interesting	New problem ceiling (prior 0.220). Weights stay int4-packed end to end; in-kernel magic-OR unpack ((b & 0xF) \| 0x4300 is the bf16 bit pattern of 128+w exactly) folds the zero-point before the tensor-core dot; evict_last keeps weights L2-resident; pointer-keyed CUDA-graph replay removes launch overhead but the kernel executes every call. POLICY CAVEAT flagged by audit: module import sets torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction=False globally, which changes reference numerics during check.py (toward higher precision, openly documented in a solution comment, tolerance 0.10) - solution code mutating harness-global state is currently unpoliced and deserves an explicit rule.	35%	out -think -cache 0	agent 48m 26scheck -bench -	reference solution	transcript5/5 New problem ceiling (prior 0.220). Weights stay int4-packed end to end; in-kernel magic-OR unpack ((b & 0xF) \| 0x4300 is the bf16 bit pattern of 128+w exactly) folds the zero-point before the tensor-core dot; evict_last keeps weights L2-resident; pointer-keyed CUDA-graph replay removes launch overhead but the kernel executes every call. POLICY CAVEAT flagged by audit: module import sets torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction=False globally, which changes reference numerics during check.py (toward higher precision, openly documented in a solution comment, tolerance 0.10) - solution code mutating harness-global state is currently unpoliced and deserves an explicit rule.
claude-opus-4-8	Claude Code	W4A16 GEMM	2026-06-1311:54:10	yes	pass	clean	-	24%	out -think -cache 0	agent 218m 1scheck -bench -	reference solution	transcript6/6 run details
cursor/composer-2.5-fast	cursor	W4A16 GEMM	2026-06-1512:33:36	yes	pass	clean	-	15%	out -think -cache 0	agent 23m 45scheck -bench -	reference solution	transcript6/6 run details
deepseek-deepseek-v4-pro	deepseek-claude	W4A16 GEMM	2026-06-1514:30:20	yes	pass	clean	-	15%	out -think -cache 0	agent 53m 29scheck -bench -	reference solution	transcript5/6 run details
gemini/gemini-3.5-flash	Gemini CLI	W4A16 GEMM	2026-06-1311:42:39	yes	pass	clean	-	17%	out -think -cache 0	agent 100m 35scheck -bench -	reference solution	transcript6/6 run details
GPT-5.5	codex	W4A16 GEMM	2026-06-1305:38:15	yes	pass	clean	-	20%	out -think -cache 0	agent 20m 0scheck -bench -	reference solution	transcript6/6 run details
kimi-kimi-k2.7-code	kimi-claude	W4A16 GEMM	2026-06-1318:15:41	yes	pass	clean	-	15%	out -think -cache 0	agent 94m 52scheck -bench -	reference solution	transcript5/6 run details
MiniMax M3	Claude Code	W4A16 GEMM	2026-06-1309:52:07	yes	pass	clean	-	14%	out -think -cache 0	agent 280m 5scheck -bench -	reference solution	transcript5/6 run details
zai-glm-5.2	Claude Code	W4A16 GEMM	2026-06-1306:14:43	yes	pass	clean	-	32%	out -think -cache 0	agent 279m 55scheck -bench -	reference solution	transcript5/6 run details

Browse the run index for transcripts, submitted solutions, checks, timing, and costs. Full historical and diagnostic rows are still available in leaderboard.json.