{
"problem": "05_topk_bitonic",
"harness": "opencode",
"model": "zai/glm-5.1",
"reasoning_effort": "",
"has_solution": false,
"correct": false,
"peak_fraction": null,
"elapsed_seconds": 754,
"harness_exit_code": 0,
"session_complete": true,
"usage": {"input_tokens": 11993, "output_tokens": 259, "cache_read_tokens": 43648, "cache_creation_tokens": 0, "reasoning_tokens": 32083, "total_cost_usd": null}
}
{
"filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic"
}<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic</path>
<type>directory</type>
<entries>
benchmark.py
check.py
problem.yaml
PROMPT.txt
reference.py
shapes.py
sota.py
(7 entries)
</entries>{
"name": "kernel-optimization"
}<skill_content name="kernel-optimization">
# Skill: kernel-optimization
# Kernel Optimization
Unified reference for GPU/TPU kernel work across all platforms. This skill routes to specialized sub-references based on what you need.
## When this skill fires
- Writing or optimizing CUDA, Metal, Triton, HIP, or Pallas kernels
- Porting kernels between platforms (CUDA <-> Metal <-> Triton <-> HIP <-> Pallas)
- Debugging GPU code (OOM, NaN, race conditions, wrong results, hangs)
- Performance profiling and optimization
- Numerical precision issues (FP formats, mixed precision, tolerance)
- AMD MI300/MI350 or TPU v4/v5/v6/v7 specific work
## Sub-references (read on demand)
The following files contain deep platform-specific knowledge. Read the relevant one(s) based on the task:
| File | When to read |
|------|-------------|
| `~/.claude/skills/amd-gpu/SKILL.md` | AMD CDNA/RDNA, ROCm, HIP, MFMA, LDS, chiplet scheduling, HipKittens, AITER |
| `~/.claude/skills/tpu-opt/SKILL.md` | TPU v4-v7, MXU, Pallas kernels, JAX sharding, XLA gotchas, SplashAttention |
| `~/.claude/skills/port-kernel/SKILL.md` | Cross-platform porting: CUDA <-> Metal <-> Triton concept maps, API equivalents, gotchas |
| `~/.claude/skills/debug-gpu/SKILL.md` | OOM diagnosis, NaN tracing, race conditions, compute-sanitizer, FP format specs, tolerance tables |
| `~/.claude/skills/perf-optim/SKILL.md` | Profiling (nsys/ncu/Metal), roofline, occupancy, memory hierarchy, CUTLASS/CuTe, quantization, KernelBench |
| `
... (1462 more chars){
"filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/reference.py"
}<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/reference.py</path>
<type>file</type>
<content>
1: """Naive top-k reference: torch.topk over the last dim.
2:
3: This is the correctness oracle. The agent's solution must produce the same
4: top-k values (and equivalent indices modulo ties) within the tolerance
5: declared in problem.yaml. Note that solution.py is FORBIDDEN from calling
6: torch.topk / torch.sort / torch.kthvalue (see problem.yaml).
7: """
8: import torch
9: import torch.nn as nn
10:
11: OP_TYPE = "topk"
12: SUPPORTED_PRECISIONS = ["fp32"]
13: HARDWARE_REQUIRED = ["RTX_PRO_6000", "H100", "B200"]
14:
15:
16: class Model(nn.Module):
17: """Top-k over the last dim of a 2D tensor.
18:
19: Input:
20: x: (batch, n) fp32
21: Output:
22: values: (batch, k) fp32, sorted descending
23: indices: (batch, k) int64, into the last dim of x
24: """
25:
26: def __init__(self, batch: int, n: int, k: int):
27: super().__init__()
28: self.batch, self.n, self.k = batch, n, k
29: # No learned parameters, but declare a dummy buffer so state_dict
30: # is non-empty and load_state_dict(strict=True) is meaningful.
31: self.register_buffer("_dummy", torch.zeros(1))
32:
33: def forward(self, x: torch.Tensor):
34: values, indices = torch.topk(x, k=self.k, dim=-1, largest=True, sorted=True)
35: return values, indices
36:
37:
38: # Module-level shims rebuilt b
... (419 more chars){
"filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/check.py"
}<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/check.py</path>
<type>file</type>
<content>
1: """Correctness runner for TopK.
2:
3: Runs solution.Model vs reference.Model across all shapes in shapes.py, 3 seeds
4: each. Top-k correctness has two parts:
5:
6: 1. VALUES: sol_values must match ref_values within fp32 tol. Both are
7: returned sorted descending, so positional comparison is well-defined.
8: 2. INDICES: lenient — we do NOT require sol_indices == ref_indices because
9: ties in x can yield multiple valid index sets. Instead we gather x at
10: sol_indices and check those values match ref_values within tol. This
11: catches "wrong indices" without false-failing on legitimate tie-breaks.
12:
13: Also rejects forbidden ops by grep.
14: """
15: import re
16: import sys
17: from pathlib import Path
18:
19: import torch
20: import yaml
21:
22: REPO_ROOT = Path(__file__).resolve().parents[2]
23: sys.path.insert(0, str(REPO_ROOT))
24:
25: from src.eval.correctness import check_correctness # noqa: E402
26:
27:
28: def main():
29: try:
30: import reference
31: import shapes
32: import solution
33: except Exception as e:
34: print(f"FAIL: import error: {e}")
35: sys.exit(1)
36:
37: problem_yaml = Path("problem.yaml")
38: meta = yaml.safe_load(problem_yaml.read_text()) if problem_yaml.exists() else {}
39:
40: # --- Forbidden-op check --------------------------------
... (4902 more chars){
"filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/shapes.py"
}<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/shapes.py</path>
<type>file</type>
<content>
1: """Canonical shape sweep for TopK.
2:
3: Mix of:
4: - decoder vocab top-k (single sequence, very large n, moderate k) — pure
5: bandwidth test; the input read dominates everything.
6: - prefill / batched attention top-k (many rows, moderate n, small k) — tests
7: per-row parallelism and shared-memory bitonic networks.
8: - non-power-of-2 n stress case — bitonic sort networks naturally want
9: powers of two; this forces the agent to handle padding or partial sorts.
10: - small-k limit — k=1 (argmax) is a degenerate but useful sanity case.
11: """
12:
13: SHAPES = [
14: {"batch": 1, "n": 131072, "k": 64}, # decoder vocab top-k (Llama vocab ~128k)
15: {"batch": 64, "n": 8192, "k": 8}, # prefill / attention top-k
16: {"batch": 32, "n": 16384, "k": 32}, # mid-size batched
17: {"batch": 16, "n": 12000, "k": 16}, # non-power-of-2 n stress
18: {"batch": 128, "n": 4096, "k": 1}, # batched argmax (k=1 corner case)
19: ]
(End of file - total 19 lines)
</content>{
"filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/benchmark.py"
}<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/benchmark.py</path>
<type>file</type>
<content>
1: """Roofline benchmark for TopK.
2:
3: For each shape: times eager reference (torch.topk), compiled reference, SOTA
4: (also torch.topk — see sota.py), and the agent's solution. Reports achieved
5: TFLOPS, GB/s, and peak_fraction (vs DRAM bandwidth, since this is memory-bound).
6:
7: Output lines the harness picks up:
8: shape=<idx> variant=<name> tflops=<N> gbps=<N> ms=<N>
9: peak_fraction: <N> (geomean over shapes of solution's peak_fraction)
10: """
11: import sys
12: from math import exp, log
13: from pathlib import Path
14:
15: import torch
16: import yaml
17:
18: REPO_ROOT = Path(__file__).resolve().parents[2]
19: sys.path.insert(0, str(REPO_ROOT))
20:
21: from src.eval.roofline import compute_gbps, compute_tflops, peak_fraction # noqa: E402
22: from src.eval.timing import time_fn # noqa: E402
23: from src.hardware import get as get_hw # noqa: E402
24:
25:
26: def _eval_formula(expr: str, vars: dict) -> float:
27: return float(eval(expr, {"__builtins__": {}}, vars))
28:
29:
30: def main():
31: import reference
32: import shapes
33: import solution
34:
35: meta = yaml.safe_load(Path("problem.yaml").read_text())
36: hw = get_hw(meta["hardware"][0])
37: peak_tflops = hw.peak_tflops_dense.get(meta["peak_tflops_key"], 0.0)
38: peak_gbps = hw.peak_bandwidth_gb_s
39: regime = meta.get("regime", "memory")
40
... (3250 more chars){
"filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/problem.yaml"
}<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/problem.yaml</path>
<type>file</type>
<content>
1: name: 05_topk_bitonic
2: display_name: "TopK via Bitonic Sort"
3: precision: fp32
4: regime: memory
5:
6: # Top-k is dominated by the input read (small output, no reduction over k).
7: # Comparator-network FLOPs are not the bottleneck on real hardware, so we
8: # track them but score on bandwidth.
9: flops_formula: "batch * n * 4" # ~O(n log^2 n) compares total, but the
10: # bitonic network is so cheap relative
11: # to memory that a coarse 4*n estimate
12: # is fine for telemetry only.
13: bytes_formula: "batch * n * 4 + batch * k * (4 + 8)" # fp32 input read + (fp32 value + int64 idx) output
14:
15: hardware: [RTX_PRO_6000]
16: peak_tflops_key: fp32
17: peak_bandwidth_key: dram
18:
19: # Top-k correctness:
20: # - VALUES must match within fp32 atol/rtol (the kth largest value is
21: # well-defined modulo float-equal ties, so we use a loose-ish tol).
22: # - INDICES are checked leniently: for each row, the multiset of returned
23: # indices must select values that match ref values within tol. Direct
24: # index equality is NOT required (ties in x can yield different valid
25: # index sets).
26: tolerance:
27: float32: 1.0e-4
28:
29: # Forbidden ops — using any of these in solution.py fails corre
... (1071 more chars){
"filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/PROMPT.txt"
}<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/PROMPT.txt</path>
<type>file</type>
<content>
1: I need you to optimize a top-k kernel for the RTX PRO 6000 (SM120 Blackwell, GDDR7, 1.8 TB/s). The reference implementation is in reference.py and your kernel goes in solution.py. You can make whatever mess you want in this directory — scratch files, intermediate kernels, build artifacts, profiling traces — but the final answer has to be in solution.py with the same Model, get_inputs, and get_init_inputs interface as reference.py.
2:
3: The input is a 2D fp32 tensor and you return the top-k values and their int64 indices along the last dim, sorted descending, same contract as torch.topk(largest=True, sorted=True). Correctness tolerance is 1e-4 abs/rel on the values; index ties are handled leniently — what matters is the returned values match the reference to tolerance, not that the index sets are bitwise identical. The shapes you have to handle are batch=1 n=131072 k=64 (decoder vocab top-k over a Llama-size vocabulary), batch=64 n=8192 k=8 (prefill / attention top-k), batch=32 n=16384 k=32 (mid-size batched), batch=16 n=12000 k=16 (non-power-of-two n, which bitonic networks don't naturally want), and batch=128 n=4096 k=1 (batched argmax).
4:
5: This needs to be a real custom kernel — CUDA C++ via torch.utils.cpp_extension.load_inline, Triton, inline PTX, or CUTLASS, whatever you think fits. Don't reach for torch.topk, torch.kthvalue, torch.sort, or torch.ar
... (749 more chars){
"filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/sota.py"
}<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/sota.py</path>
<type>file</type>
<content>
1: """SOTA reference for TopK: torch.topk itself.
2:
3: torch.topk dispatches to a CUB-backed kernel that uses radix-select for
4: moderate k and a tuned bitonic sort for small n. It is the bar the agent's
5: hand-rolled bitonic kernel must beat. There is no obvious vendor library that
6: does better on the (batch, n, k) shape mix we evaluate — Faiss BlockSelect is
7: specialized for k>=32 with much larger n, and CUB's DeviceSegmentedRadixSort
8: sorts the full row (overkill for top-k).
9:
10: This file is INTENTIONALLY allowed to call torch.topk because it is the SOTA
11: oracle, not the agent's submission. The agent's solution.py is forbidden from
12: using torch.topk (see problem.yaml.forbidden).
13: """
14: from __future__ import annotations
15:
16: import torch
17:
18:
19: def sota_forward(x: torch.Tensor, k: int):
20: """Best-available top-k reference. x: (batch, n) fp32."""
21: return torch.topk(x, k=k, dim=-1, largest=True, sorted=True)
22:
23:
24: def is_available() -> bool:
25: return True
(End of file - total 25 lines)
</content>