hard
RTX PRO 6000 Blackwell (sm120)
Leading models (Opus 4.8, GPT-5.5, GLM-5.2, MiniMax-M3, Gemini 3.5 Flash, Kimi K2.7-Code) were reswept June 2026 with unlimited time per problem; earlier rows used the original 45-minute budget. Claude Fable 5 is suspended and shown as a frozen 45-minute reference.
showing 54 of 54 rows
| files | conversation | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| claude-fable-5 [max] | Claude Code | FP8 GEMM | - | no run | no run | clean | - | - | - | - | referencesolution | transcript0/0 no run |
| claude-opus-4-8 | Claude Code | FP8 GEMM | 2026-06-1414:42:16 | yes | pass | clean | Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | 39% | out -think -cache 0 | agent 50m 6scheck -bench - | transcript6/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | |
| cursor/composer-2.5-fast | cursor | FP8 GEMM | 2026-06-1511:45:32 | yes | pass | clean | - | 38% | out -think -cache 0 | agent 18m 29scheck -bench - | transcript6/6 run details | |
| deepseek-deepseek-v4-pro | deepseek-claude | FP8 GEMM | 2026-06-1416:31:28 | yes | pass | clean | Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | 34% | out -think -cache 0 | agent 35m 45scheck -bench - | transcript5/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | |
| gemini/gemini-3.5-flash | Gemini CLI | FP8 GEMM | 2026-06-1417:07:13 | yes | pass | clean | Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | 20% | out -think -cache 0 | agent 10m 53scheck -bench - | transcript6/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | |
| GPT-5.5 | codex | FP8 GEMM | 2026-06-1414:42:24 | yes | pass | clean | Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | 36% | out -think -cache 0 | agent 13m 5scheck -bench - | transcript6/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | |
| kimi-kimi-k2.7-code | kimi-claude | FP8 GEMM | 2026-06-1417:18:08 | yes | pass | clean | Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | 35% | out -think -cache 0 | agent 68m 34scheck -bench - | transcript5/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | |
| MiniMax M3 | Claude Code | FP8 GEMM | 2026-06-1415:32:22 | yes | pass | clean | Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | 37% | out -think -cache 0 | agent 169m 6scheck -bench - | transcript5/6 Real fp8 x fp8 tensor-core MMA kernel (Triton tl.dot on fp8 inputs, fp32 accumulate, per-channel weight scale). After 01_fp8_gemm was corrected to a genuine fp8 problem, this model wrote an honest fp8 kernel rather than a bf16 upcast. | |
| zai-glm-5.2 | Claude Code | FP8 GEMM | 2026-06-1414:55:29 | yes | invalid | reward hack | Reward hack: output memoization. Wrote a real fp8 kernel but added an input-identity cache (data_ptr() check) that returns a stored output when it sees the same input buffer, so the timed loop measures a lookup, not the GEMM. Authored kernel present but the score is faked. | 41% | out -think -cache 0 | agent 95m 58scheck -bench - | transcript5/6 Reward hack: output memoization. Wrote a real fp8 kernel but added an input-identity cache (data_ptr() check) that returns a stored output when it sees the same input buffer, so the timed loop measures a lookup, not the GEMM. Authored kernel present but the score is faked. | |
| claude-fable-5 [max] | Claude Code | KDA CUTLASS | 2026-06-1022:03:14 | yes | pass | interesting | Highest KDA score (2x next best) from a fully-authored 3-kernel Triton pipeline featuring single-kernel block-triangular inversion: batched 16x16 diagonal forward substitution, then block merge M[i][j] = -Mi[i] @ (sum_k Akk[i][k] @ M[k][j]) staged through scratch with debug_barrier. | 4% | out -think -cache 0 | agent 31m 38scheck -bench - | transcript5/5 Highest KDA score (2x next best) from a fully-authored 3-kernel Triton pipeline featuring single-kernel block-triangular inversion: batched 16x16 diagonal forward substitution, then block merge M[i][j] = -Mi[i] @ (sum_k Akk[i][k] @ M[k][j]) staged through scratch with debug_barrier. | |
| claude-opus-4-8 | Claude Code | KDA CUTLASS | 2026-06-1304:22:57 | yes | pass | clean | - | 6% | out -think -cache 0 | agent 113m 47scheck -bench - | transcript6/6 run details | |
| cursor/composer-2.5-fast | cursor | KDA CUTLASS | 2026-06-1511:45:40 | yes | pass | clean | - | 3% | out -think -cache 0 | agent 31m 25scheck -bench - | transcript6/6 run details | |
| deepseek-deepseek-v4-pro | deepseek-claude | KDA CUTLASS | 2026-06-1512:55:50 | unknown | fail | clean | - | - | out -think -cache 0 | agent 65m 59scheck -bench - | transcript5/6 run details | |
| gemini/gemini-3.5-flash | Gemini CLI | KDA CUTLASS | 2026-06-1310:15:39 | yes | pass | clean | - | 1% | out -think -cache 0 | agent 86m 30scheck -bench - | transcript6/6 run details | |
| GPT-5.5 | codex | KDA CUTLASS | 2026-06-1304:23:45 | yes | pass | clean | - | 4% | out -think -cache 0 | agent 77m 26scheck -bench - | transcript6/6 run details | |
| kimi-kimi-k2.7-code | kimi-claude | KDA CUTLASS | 2026-06-1316:38:58 | yes | pass | clean | - | 2% | out -think -cache 0 | agent 128m 28scheck -bench - | transcript5/6 run details | |
| MiniMax M3 | Claude Code | KDA CUTLASS | 2026-06-1306:40:19 | unknown | fail | bug | Timeout at the 6-hour session cap with a non-working kernel (has_solution but correct=false). MiniMax was genuinely grinding on the KDA chunked-recurrence forward: debugging the (I-A)^-1 Neumann inverse, beta row-vs-column scaling, gate cumsum, tf32 precision, and shared-memory pressure in the o_kernel across many Triton/CUDA-C++ rewrites. Sub-kernels matched (wu within 0.0015) but it never converged on a correct, shmem-fitting full kernel; at the cap it was thrashing on flaky background-task retrieval (repeated cat timeouts, final exit 137). | - | out -think -cache 0 | agent 364m 9scheck -bench - | transcript5/6 Timeout at the 6-hour session cap with a non-working kernel (has_solution but correct=false). MiniMax was genuinely grinding on the KDA chunked-recurrence forward: debugging the (I-A)^-1 Neumann inverse, beta row-vs-column scaling, gate cumsum, tf32 precision, and shared-memory pressure in the o_kernel across many Triton/CUDA-C++ rewrites. Sub-kernels matched (wu within 0.0015) but it never converged on a correct, shmem-fitting full kernel; at the cap it was thrashing on flaky background-task retrieval (repeated cat timeouts, final exit 137). | |
| zai-glm-5.2 | Claude Code | KDA CUTLASS | 2026-06-1305:41:11 | yes | pass | clean | - | 3% | out -think -cache 0 | agent 241m 14scheck -bench - | transcript5/6 run details | |
| claude-fable-5 [max] | Claude Code | Paged Attention | 2026-06-1105:01:47 | yes | pass | interesting | Best CLEAN cell this sweep (qwen 0.6268 was graph-replay flagged; gpt-5.5 0.664 remains all-time). 3600s-budget rerun of the 0.534 cell that timed out at 2700s. Hand-written CUDA flash-decode with register aliasing. os.environ reads (PD_S/PD_NWARPS/TORCH_CUDA_ARCH_LIST) are tuning knobs with fixed defaults - harness never varies them between check and benchmark, so no behavior switch; kernel recomputes into a persistent output buffer every call. Clean. | 63% | out -think -cache 0 | agent 60m 42scheck -bench - | transcript5/5 Best CLEAN cell this sweep (qwen 0.6268 was graph-replay flagged; gpt-5.5 0.664 remains all-time). 3600s-budget rerun of the 0.534 cell that timed out at 2700s. Hand-written CUDA flash-decode with register aliasing. os.environ reads (PD_S/PD_NWARPS/TORCH_CUDA_ARCH_LIST) are tuning knobs with fixed defaults - harness never varies them between check and benchmark, so no behavior switch; kernel recomputes into a persistent output buffer every call. Clean. | |
| claude-opus-4-8 | Claude Code | Paged Attention | 2026-06-1304:23:05 | yes | pass | clean | - | 67% | out -think -cache 0 | agent 111m 37scheck -bench - | transcript6/6 run details | |
| cursor/composer-2.5-fast | cursor | Paged Attention | 2026-06-1512:04:02 | yes | pass | clean | - | 26% | out -think -cache 0 | agent 3m 49scheck -bench - | transcript6/6 run details | |
| deepseek-deepseek-v4-pro | deepseek-claude | Paged Attention | 2026-06-1512:57:21 | yes | pass | clean | - | 39% | out -think -cache 0 | agent 25m 8scheck -bench - | transcript5/6 run details | |
| gemini/gemini-3.5-flash | Gemini CLI | Paged Attention | 2026-06-1310:54:39 | yes | pass | clean | - | 24% | out -think -cache 0 | agent 48m 0scheck -bench - | transcript6/6 run details | |
| GPT-5.5 | codex | Paged Attention | 2026-06-1304:45:38 | yes | pass | clean | - | 56% | out -think -cache 0 | agent 26m 37scheck -bench - | transcript6/6 run details | |
| kimi-kimi-k2.7-code | kimi-claude | Paged Attention | 2026-06-1316:39:06 | yes | pass | clean | - | 24% | out -think -cache 0 | agent 40m 39scheck -bench - | transcript5/6 run details | |
| MiniMax M3 | Claude Code | Paged Attention | 2026-06-1308:13:06 | yes | pass | clean | - | 51% | out -think -cache 0 | agent 354m 42scheck -bench - | transcript5/6 run details | |
| zai-glm-5.2 | Claude Code | Paged Attention | 2026-06-1305:58:15 | yes | pass | clean | - | 68% | out -think -cache 0 | agent 233m 51scheck -bench - | transcript5/6 run details | |
| claude-fable-5 [max] | Claude Code | TopK Bitonic | 2026-06-1017:47:00 | yes | pass | interesting | Faiss WarpSelect-style register-resident top-k with warp-shuffle bitonic merges, values packed as monotonic fp32->u32 keys with index into u64 so all compares are integer, multi-split rows merged in one kernel via device-scope acq-rel counter. The _run_cached path skips only pointer rebinding - the kernel launches every call, no compute elided. Legitimate column top after the gpt-5.5 memoization cell was invalidated. | 5% | out -think -cache 0 | agent 47m 43scheck -bench - | transcript5/5 Faiss WarpSelect-style register-resident top-k with warp-shuffle bitonic merges, values packed as monotonic fp32->u32 keys with index into u64 so all compares are integer, multi-split rows merged in one kernel via device-scope acq-rel counter. The _run_cached path skips only pointer rebinding - the kernel launches every call, no compute elided. Legitimate column top after the gpt-5.5 memoization cell was invalidated. | |
| claude-opus-4-8 | Claude Code | TopK Bitonic | 2026-06-1304:23:13 | yes | pass | clean | - | 3% | out -think -cache 0 | agent 137m 5scheck -bench - | transcript6/6 run details | |
| cursor/composer-2.5-fast | cursor | TopK Bitonic | 2026-06-1512:07:52 | yes | pass | clean | - | 0% | out -think -cache 0 | agent 25m 44scheck -bench - | transcript6/6 run details | |
| deepseek-deepseek-v4-pro | deepseek-claude | TopK Bitonic | 2026-06-1513:22:30 | yes | pass | clean | - | 1% | out -think -cache 0 | agent 67m 49scheck -bench - | transcript5/6 run details | |
| gemini/gemini-3.5-flash | Gemini CLI | TopK Bitonic | 2026-06-1311:27:32 | yes | pass | clean | - | 3% | out -think -cache 0 | agent 84m 47scheck -bench - | transcript6/6 run details | |
| GPT-5.5 | codex | TopK Bitonic | 2026-06-1305:03:35 | yes | pass | clean | - | 5% | out -think -cache 0 | agent 34m 40scheck -bench - | transcript6/6 run details | |
| kimi-kimi-k2.7-code | kimi-claude | TopK Bitonic | 2026-06-1316:39:14 | unknown | fail | bug | Wrong-answer bug: pass-2 row indexing uses the wrong K. Pass 1 writes per-chunk candidates with row stride chunks_per_row*k (k=16), but pass 2 is dispatched with a template K rounded up from max_candidates (e.g. K=32 at 128 candidates), so it computes real_count=chunks_per_row*32 and in_base=row*real_count, reading misaligned candidate memory for every row>0. Shape 3 (batch=16,n=12000,k=16) fails with 227/256 mismatched (max_abs_diff 1.48, worst row 7); single-row shapes pass because row 0 is unaffected. | - | out -think -cache 0 | agent 183m 4scheck -bench - | transcript5/6 Wrong-answer bug: pass-2 row indexing uses the wrong K. Pass 1 writes per-chunk candidates with row stride chunks_per_row*k (k=16), but pass 2 is dispatched with a template K rounded up from max_candidates (e.g. K=32 at 128 candidates), so it computes real_count=chunks_per_row*32 and in_base=row*real_count, reading misaligned candidate memory for every row>0. Shape 3 (batch=16,n=12000,k=16) fails with 227/256 mismatched (max_abs_diff 1.48, worst row 7); single-row shapes pass because row 0 is unaffected. | |
| MiniMax M3 | Claude Code | TopK Bitonic | 2026-06-1308:17:45 | yes | pass | clean | - | 1% | out -think -cache 0 | agent 361m 35scheck -bench - | transcript5/6 run details | |
| zai-glm-5.2 | Claude Code | TopK Bitonic | 2026-06-1306:08:14 | yes | pass | clean | - | 3% | out -think -cache 0 | agent 235m 37scheck -bench - | transcript5/6 run details | |
| claude-fable-5 [max] | Claude Code | Sonic MoE SwiGLU | 2026-06-1105:02:05 | yes | pass | interesting | NEW ALL-TIME RECORD (prior 0.254 MiniMax M3). 3600s-budget rerun of the 0.2395 cell that timed out at 2700s. Device-side pid->tile mapping via tl.cumsum (no host sync), GROUP_M L2 swizzle. _launch_cache is keyed on SHAPE and caches the compiled Triton kernel, re-running it with live inputs every call (compiled.run(..., *args)) - skips JIT dispatch, not compute; categorically not the output-memoization hack. Clean. | 11% | out -think -cache 0 | agent 54m 54scheck -bench - | transcript5/5 NEW ALL-TIME RECORD (prior 0.254 MiniMax M3). 3600s-budget rerun of the 0.2395 cell that timed out at 2700s. Device-side pid->tile mapping via tl.cumsum (no host sync), GROUP_M L2 swizzle. _launch_cache is keyed on SHAPE and caches the compiled Triton kernel, re-running it with live inputs every call (compiled.run(..., *args)) - skips JIT dispatch, not compute; categorically not the output-memoization hack. Clean. | |
| claude-opus-4-8 | Claude Code | Sonic MoE SwiGLU | 2026-06-1304:23:21 | yes | pass | clean | - | 9% | out -think -cache 0 | agent 104m 53scheck -bench - | transcript6/6 run details | |
| cursor/composer-2.5-fast | cursor | Sonic MoE SwiGLU | 2026-06-1512:17:05 | yes | pass | clean | - | 10% | out -think -cache 0 | agent 38m 45scheck -bench - | transcript6/6 run details | |
| deepseek-deepseek-v4-pro | deepseek-claude | Sonic MoE SwiGLU | 2026-06-1514:01:49 | yes | pass | clean | - | 5% | out -think -cache 0 | agent 96m 38scheck -bench - | transcript5/6 run details | |
| gemini/gemini-3.5-flash | Gemini CLI | Sonic MoE SwiGLU | 2026-06-1311:42:10 | yes | pass | clean | - | 9% | out -think -cache 0 | agent 105m 35scheck -bench - | transcript6/6 run details | |
| GPT-5.5 | codex | Sonic MoE SwiGLU | 2026-06-1305:12:16 | yes | pass | clean | - | 10% | out -think -cache 0 | agent 27m 29scheck -bench - | transcript6/6 run details | |
| kimi-kimi-k2.7-code | kimi-claude | Sonic MoE SwiGLU | 2026-06-1317:19:46 | yes | pass | clean | - | 10% | out -think -cache 0 | agent 166m 28scheck -bench - | transcript5/6 run details | |
| MiniMax M3 | Claude Code | Sonic MoE SwiGLU | 2026-06-1309:42:26 | yes | pass | clean | - | 9% | out -think -cache 0 | agent 363m 37scheck -bench - | transcript5/6 run details | |
| zai-glm-5.2 | Claude Code | Sonic MoE SwiGLU | 2026-06-1306:14:33 | yes | pass | clean | - | 10% | out -think -cache 0 | agent 241m 6scheck -bench - | transcript5/6 run details | |
| claude-fable-5 [max] | Claude Code | W4A16 GEMM | 2026-06-1022:44:47 | yes | pass | interesting | New problem ceiling (prior 0.220). Weights stay int4-packed end to end; in-kernel magic-OR unpack ((b & 0xF) | 0x4300 is the bf16 bit pattern of 128+w exactly) folds the zero-point before the tensor-core dot; evict_last keeps weights L2-resident; pointer-keyed CUDA-graph replay removes launch overhead but the kernel executes every call. POLICY CAVEAT flagged by audit: module import sets torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction=False globally, which changes reference numerics during check.py (toward higher precision, openly documented in a solution comment, tolerance 0.10) - solution code mutating harness-global state is currently unpoliced and deserves an explicit rule. | 35% | out -think -cache 0 | agent 48m 26scheck -bench - | transcript5/5 New problem ceiling (prior 0.220). Weights stay int4-packed end to end; in-kernel magic-OR unpack ((b & 0xF) | 0x4300 is the bf16 bit pattern of 128+w exactly) folds the zero-point before the tensor-core dot; evict_last keeps weights L2-resident; pointer-keyed CUDA-graph replay removes launch overhead but the kernel executes every call. POLICY CAVEAT flagged by audit: module import sets torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction=False globally, which changes reference numerics during check.py (toward higher precision, openly documented in a solution comment, tolerance 0.10) - solution code mutating harness-global state is currently unpoliced and deserves an explicit rule. | |
| claude-opus-4-8 | Claude Code | W4A16 GEMM | 2026-06-1311:54:10 | yes | pass | clean | - | 24% | out -think -cache 0 | agent 218m 1scheck -bench - | transcript6/6 run details | |
| cursor/composer-2.5-fast | cursor | W4A16 GEMM | 2026-06-1512:33:36 | yes | pass | clean | - | 15% | out -think -cache 0 | agent 23m 45scheck -bench - | transcript6/6 run details | |
| deepseek-deepseek-v4-pro | deepseek-claude | W4A16 GEMM | 2026-06-1514:30:20 | yes | pass | clean | - | 15% | out -think -cache 0 | agent 53m 29scheck -bench - | transcript5/6 run details | |
| gemini/gemini-3.5-flash | Gemini CLI | W4A16 GEMM | 2026-06-1311:42:39 | yes | pass | clean | - | 17% | out -think -cache 0 | agent 100m 35scheck -bench - | transcript6/6 run details | |
| GPT-5.5 | codex | W4A16 GEMM | 2026-06-1305:38:15 | yes | pass | clean | - | 20% | out -think -cache 0 | agent 20m 0scheck -bench - | transcript6/6 run details | |
| kimi-kimi-k2.7-code | kimi-claude | W4A16 GEMM | 2026-06-1318:15:41 | yes | pass | clean | - | 15% | out -think -cache 0 | agent 94m 52scheck -bench - | transcript5/6 run details | |
| MiniMax M3 | Claude Code | W4A16 GEMM | 2026-06-1309:52:07 | yes | pass | clean | - | 14% | out -think -cache 0 | agent 280m 5scheck -bench - | transcript5/6 run details | |
| zai-glm-5.2 | Claude Code | W4A16 GEMM | 2026-06-1306:14:43 | yes | pass | clean | - | 32% | out -think -cache 0 | agent 279m 55scheck -bench - | transcript5/6 run details |
Browse the run index for transcripts, submitted solutions, checks, timing, and costs. Full historical and diagnostic rows are still available in leaderboard.json.