opencode / ? — KernelBench-Hard

harness

opencode

model

turns

tools called

events

input toks

11,993

output toks

259

cache hit

43,648

result.json

{
    "problem": "05_topk_bitonic",
    "harness": "opencode",
    "model": "zai/glm-5.1",
    "reasoning_effort": "",
    "has_solution": false,
    "correct": false,
    "peak_fraction": null,
    "elapsed_seconds": 754,
    "harness_exit_code": 0,
    "session_complete": true,
    "usage": {"input_tokens": 11993, "output_tokens": 259, "cache_read_tokens": 43648, "cache_creation_tokens": 0, "reasoning_tokens": 32083, "total_cost_usd": null}
}

timeline (21 events)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic"
}

tool

stdout (209 chars)

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic</path>
<type>directory</type>
<entries>
benchmark.py
check.py
problem.yaml
PROMPT.txt
reference.py
shapes.py
sota.py

(7 entries)
</entries>

assistant

skill

args

{
  "name": "kernel-optimization"
}

tool

stdout (2962 chars) — TRUNCATED

<skill_content name="kernel-optimization">
# Skill: kernel-optimization

# Kernel Optimization

Unified reference for GPU/TPU kernel work across all platforms. This skill routes to specialized sub-references based on what you need.

## When this skill fires

- Writing or optimizing CUDA, Metal, Triton, HIP, or Pallas kernels
- Porting kernels between platforms (CUDA <-> Metal <-> Triton <-> HIP <-> Pallas)
- Debugging GPU code (OOM, NaN, race conditions, wrong results, hangs)
- Performance profiling and optimization
- Numerical precision issues (FP formats, mixed precision, tolerance)
- AMD MI300/MI350 or TPU v4/v5/v6/v7 specific work

## Sub-references (read on demand)

The following files contain deep platform-specific knowledge. Read the relevant one(s) based on the task:

| File | When to read |
|------|-------------|
| `~/.claude/skills/amd-gpu/SKILL.md` | AMD CDNA/RDNA, ROCm, HIP, MFMA, LDS, chiplet scheduling, HipKittens, AITER |
| `~/.claude/skills/tpu-opt/SKILL.md` | TPU v4-v7, MXU, Pallas kernels, JAX sharding, XLA gotchas, SplashAttention |
| `~/.claude/skills/port-kernel/SKILL.md` | Cross-platform porting: CUDA <-> Metal <-> Triton concept maps, API equivalents, gotchas |
| `~/.claude/skills/debug-gpu/SKILL.md` | OOM diagnosis, NaN tracing, race conditions, compute-sanitizer, FP format specs, tolerance tables |
| `~/.claude/skills/perf-optim/SKILL.md` | Profiling (nsys/ncu/Metal), roofline, occupancy, memory hierarchy, CUTLASS/CuTe, quantization, KernelBench |
| `

... (1462 more chars)

assistant

in 4946out 42cache_r 10944cache_w 0

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/reference.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/reference.py"
}

tool

stdout (1919 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/reference.py</path>
<type>file</type>
<content>
1: """Naive top-k reference: torch.topk over the last dim.
2: 
3: This is the correctness oracle. The agent's solution must produce the same
4: top-k values (and equivalent indices modulo ties) within the tolerance
5: declared in problem.yaml. Note that solution.py is FORBIDDEN from calling
6: torch.topk / torch.sort / torch.kthvalue (see problem.yaml).
7: """
8: import torch
9: import torch.nn as nn
10: 
11: OP_TYPE = "topk"
12: SUPPORTED_PRECISIONS = ["fp32"]
13: HARDWARE_REQUIRED = ["RTX_PRO_6000", "H100", "B200"]
14: 
15: 
16: class Model(nn.Module):
17:     """Top-k over the last dim of a 2D tensor.
18: 
19:     Input:
20:         x: (batch, n) fp32
21:     Output:
22:         values:  (batch, k) fp32, sorted descending
23:         indices: (batch, k) int64, into the last dim of x
24:     """
25: 
26:     def __init__(self, batch: int, n: int, k: int):
27:         super().__init__()
28:         self.batch, self.n, self.k = batch, n, k
29:         # No learned parameters, but declare a dummy buffer so state_dict
30:         # is non-empty and load_state_dict(strict=True) is meaningful.
31:         self.register_buffer("_dummy", torch.zeros(1))
32: 
33:     def forward(self, x: torch.Tensor):
34:         values, indices = torch.topk(x, k=self.k, dim=-1, largest=True, sorted=True)
35:         return values, indices
36: 
37: 
38: # Module-level shims rebuilt b

... (419 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/check.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/check.py"
}

tool

stdout (6402 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/check.py</path>
<type>file</type>
<content>
1: """Correctness runner for TopK.
2: 
3: Runs solution.Model vs reference.Model across all shapes in shapes.py, 3 seeds
4: each. Top-k correctness has two parts:
5: 
6:   1. VALUES: sol_values must match ref_values within fp32 tol. Both are
7:      returned sorted descending, so positional comparison is well-defined.
8:   2. INDICES: lenient — we do NOT require sol_indices == ref_indices because
9:      ties in x can yield multiple valid index sets. Instead we gather x at
10:      sol_indices and check those values match ref_values within tol. This
11:      catches "wrong indices" without false-failing on legitimate tie-breaks.
12: 
13: Also rejects forbidden ops by grep.
14: """
15: import re
16: import sys
17: from pathlib import Path
18: 
19: import torch
20: import yaml
21: 
22: REPO_ROOT = Path(__file__).resolve().parents[2]
23: sys.path.insert(0, str(REPO_ROOT))
24: 
25: from src.eval.correctness import check_correctness  # noqa: E402
26: 
27: 
28: def main():
29:     try:
30:         import reference
31:         import shapes
32:         import solution
33:     except Exception as e:
34:         print(f"FAIL: import error: {e}")
35:         sys.exit(1)
36: 
37:     problem_yaml = Path("problem.yaml")
38:     meta = yaml.safe_load(problem_yaml.read_text()) if problem_yaml.exists() else {}
39: 
40:     # --- Forbidden-op check --------------------------------

... (4902 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/shapes.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/shapes.py"
}

tool

stdout (1148 chars)

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/shapes.py</path>
<type>file</type>
<content>
1: """Canonical shape sweep for TopK.
2: 
3: Mix of:
4:   - decoder vocab top-k (single sequence, very large n, moderate k) — pure
5:     bandwidth test; the input read dominates everything.
6:   - prefill / batched attention top-k (many rows, moderate n, small k) — tests
7:     per-row parallelism and shared-memory bitonic networks.
8:   - non-power-of-2 n stress case — bitonic sort networks naturally want
9:     powers of two; this forces the agent to handle padding or partial sorts.
10:   - small-k limit — k=1 (argmax) is a degenerate but useful sanity case.
11: """
12: 
13: SHAPES = [
14:     {"batch": 1,   "n": 131072, "k": 64},   # decoder vocab top-k (Llama vocab ~128k)
15:     {"batch": 64,  "n": 8192,   "k": 8},    # prefill / attention top-k
16:     {"batch": 32,  "n": 16384,  "k": 32},   # mid-size batched
17:     {"batch": 16,  "n": 12000,  "k": 16},   # non-power-of-2 n stress
18:     {"batch": 128, "n": 4096,   "k": 1},    # batched argmax (k=1 corner case)
19: ]

(End of file - total 19 lines)
</content>

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/benchmark.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/benchmark.py"
}

tool

stdout (4750 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/benchmark.py</path>
<type>file</type>
<content>
1: """Roofline benchmark for TopK.
2: 
3: For each shape: times eager reference (torch.topk), compiled reference, SOTA
4: (also torch.topk — see sota.py), and the agent's solution. Reports achieved
5: TFLOPS, GB/s, and peak_fraction (vs DRAM bandwidth, since this is memory-bound).
6: 
7: Output lines the harness picks up:
8:   shape=<idx> variant=<name> tflops=<N> gbps=<N> ms=<N>
9:   peak_fraction: <N>  (geomean over shapes of solution's peak_fraction)
10: """
11: import sys
12: from math import exp, log
13: from pathlib import Path
14: 
15: import torch
16: import yaml
17: 
18: REPO_ROOT = Path(__file__).resolve().parents[2]
19: sys.path.insert(0, str(REPO_ROOT))
20: 
21: from src.eval.roofline import compute_gbps, compute_tflops, peak_fraction  # noqa: E402
22: from src.eval.timing import time_fn  # noqa: E402
23: from src.hardware import get as get_hw  # noqa: E402
24: 
25: 
26: def _eval_formula(expr: str, vars: dict) -> float:
27:     return float(eval(expr, {"__builtins__": {}}, vars))
28: 
29: 
30: def main():
31:     import reference
32:     import shapes
33:     import solution
34: 
35:     meta = yaml.safe_load(Path("problem.yaml").read_text())
36:     hw = get_hw(meta["hardware"][0])
37:     peak_tflops = hw.peak_tflops_dense.get(meta["peak_tflops_key"], 0.0)
38:     peak_gbps = hw.peak_bandwidth_gb_s
39:     regime = meta.get("regime", "memory")
40

... (3250 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/problem.yaml

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/problem.yaml"
}

tool

stdout (2571 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/problem.yaml</path>
<type>file</type>
<content>
1: name: 05_topk_bitonic
2: display_name: "TopK via Bitonic Sort"
3: precision: fp32
4: regime: memory
5: 
6: # Top-k is dominated by the input read (small output, no reduction over k).
7: # Comparator-network FLOPs are not the bottleneck on real hardware, so we
8: # track them but score on bandwidth.
9: flops_formula: "batch * n * 4"            # ~O(n log^2 n) compares total, but the
10:                                           # bitonic network is so cheap relative
11:                                           # to memory that a coarse 4*n estimate
12:                                           # is fine for telemetry only.
13: bytes_formula: "batch * n * 4 + batch * k * (4 + 8)"  # fp32 input read + (fp32 value + int64 idx) output
14: 
15: hardware: [RTX_PRO_6000]
16: peak_tflops_key: fp32
17: peak_bandwidth_key: dram
18: 
19: # Top-k correctness:
20: #   - VALUES must match within fp32 atol/rtol (the kth largest value is
21: #     well-defined modulo float-equal ties, so we use a loose-ish tol).
22: #   - INDICES are checked leniently: for each row, the multiset of returned
23: #     indices must select values that match ref values within tol. Direct
24: #     index equality is NOT required (ties in x can yield different valid
25: #     index sets).
26: tolerance:
27:   float32: 1.0e-4
28: 
29: # Forbidden ops — using any of these in solution.py fails corre

... (1071 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/PROMPT.txt

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/PROMPT.txt"
}

tool

stdout (2249 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/PROMPT.txt</path>
<type>file</type>
<content>
1: I need you to optimize a top-k kernel for the RTX PRO 6000 (SM120 Blackwell, GDDR7, 1.8 TB/s). The reference implementation is in reference.py and your kernel goes in solution.py. You can make whatever mess you want in this directory — scratch files, intermediate kernels, build artifacts, profiling traces — but the final answer has to be in solution.py with the same Model, get_inputs, and get_init_inputs interface as reference.py.
2: 
3: The input is a 2D fp32 tensor and you return the top-k values and their int64 indices along the last dim, sorted descending, same contract as torch.topk(largest=True, sorted=True). Correctness tolerance is 1e-4 abs/rel on the values; index ties are handled leniently — what matters is the returned values match the reference to tolerance, not that the index sets are bitwise identical. The shapes you have to handle are batch=1 n=131072 k=64 (decoder vocab top-k over a Llama-size vocabulary), batch=64 n=8192 k=8 (prefill / attention top-k), batch=32 n=16384 k=32 (mid-size batched), batch=16 n=12000 k=16 (non-power-of-two n, which bitonic networks don't naturally want), and batch=128 n=4096 k=1 (batched argmax).
4: 
5: This needs to be a real custom kernel — CUDA C++ via torch.utils.cpp_extension.load_inline, Triton, inline PTX, or CUTLASS, whatever you think fits. Don't reach for torch.topk, torch.kthvalue, torch.sort, or torch.ar

... (749 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/sota.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/sota.py"
}

tool

stdout (1188 chars)

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/05_topk_bitonic/sota.py</path>
<type>file</type>
<content>
1: """SOTA reference for TopK: torch.topk itself.
2: 
3: torch.topk dispatches to a CUB-backed kernel that uses radix-select for
4: moderate k and a tuned bitonic sort for small n. It is the bar the agent's
5: hand-rolled bitonic kernel must beat. There is no obvious vendor library that
6: does better on the (batch, n, k) shape mix we evaluate — Faiss BlockSelect is
7: specialized for k>=32 with much larger n, and CUB's DeviceSegmentedRadixSort
8: sorts the full row (overkill for top-k).
9: 
10: This file is INTENTIONALLY allowed to call torch.topk because it is the SOTA
11: oracle, not the agent's submission. The agent's solution.py is forbidden from
12: using torch.topk (see problem.yaml.forbidden).
13: """
14: from __future__ import annotations
15: 
16: import torch
17: 
18: 
19: def sota_forward(x: torch.Tensor, k: int):
20:     """Best-available top-k reference. x: (batch, n) fp32."""
21:     return torch.topk(x, k=k, dim=-1, largest=True, sorted=True)
22: 
23: 
24: def is_available() -> bool:
25:     return True

(End of file - total 25 lines)
</content>

assistant

in 1003out 217cache_r 15872cache_w 0

assistant

in 6044out 0cache_r 16832cache_w 0