{
"problem": "02_kda_cutlass",
"harness": "droid",
"model": "custom:GLM-5.1-[Z.AI-Coding-Plan]-0",
"reasoning_effort": "",
"has_solution": false,
"correct": false,
"peak_fraction": null,
"elapsed_seconds": 2700,
"harness_exit_code": 124,
"session_complete": false,
"usage": {"input_tokens": 29347, "output_tokens": 28472, "cache_read_tokens": 82816, "cache_creation_tokens": 0, "reasoning_tokens": 0, "total_cost_usd": null}
}
user
I need you to implement Kimi Delta Attention forward (chunk form) for the RTX PRO 6000 (SM120 Blackwell, GDDR7, 1.8 TB/s). The reference implementation is in reference.py and your kernel goes in solution.py. You can make whatever mess you want in this directory — scratch files, intermediate kernels, build artifacts, profiling traces — but the final answer has to be in solution.py with the same Model, get_inputs, and get_init_inputs interface as reference.py.
The op is the chunk-parallel KDA forward from the FLA library: q and k of shape (B, T, H, K) in bf16, v of shape (B, T, H, V) in bf16, g of shape (B, T, H, K) in fp32 (per-channel log-decay with in-chunk cumsum already applied), beta of shape (B, T, H) in bf16, scale a python float, chunk_size 64, no initial state, no final state. You return o of shape (B, T, H, V) in bf16. Correctness tolerance is 0.05 abs/rel — the long recurrence accumulates more error than a single GEMM so the bar's a bit looser than default bf16. The shapes you have to handle are B=2 T=1024 H=8 K=128 V=128 (short-context training step), B=2 T=2048 H=8 K=128 V=128 (the headline shape from the Kimi Linear paper), B=1 T=4096 H=8 K=128 V=128 (long context that stresses the inter-chunk recurrence), and B=1 T=2048 H=4 K=128 V=128 (thin batch decode).
This needs to be a real custom kernel — the whole point of the problem is to write the chunk-parallel attention yourself, not call FLA's existing implementation. Don't import or call fla.ops.kda, fla.ops.chunk_kda, chunk_kda, fused_recurrent_kda, naive_chunk_kda, or naive_recurrent_kda. The intended path is CUTLASS CuTe on SM120 but Triton, CUDA C++ via load_inline, or inline PTX are also fine if you prefer. Anything you're uncertain about, look up PTX docs, clone CUTLASS or FLA or other reference repos, read library headers, and investigate.
Your flywheel is implement, profile (ncu, nsys, torch.profiler — whatever's useful) and time it with benchmark.py, verify correctness by running `python check.py` and reading the output, then iterate. Don't substitute your own one-off correctness snippets for check.py — it iterates over every shape, your spot-check almost certainly won't. If `python check.py` hasn't printed PASS, you're not done. Take as long as you need to actually push the number up.