opencode / ? — KernelBench-Hard

harness

opencode

model

turns

tools called

events

input toks

166,365

output toks

638

cache hit

final answer

result.json

Now let me understand the eval infrastructure and hardware specs:

{
    "problem": "03_paged_attention",
    "harness": "opencode",
    "model": "openrouter-pinned/qwen/qwen3.6-max-preview",
    "reasoning_effort": "",
    "has_solution": false,
    "correct": false,
    "peak_fraction": null,
    "elapsed_seconds": 48,
    "harness_exit_code": 0,
    "session_complete": true,
    "usage": {"input_tokens": 166365, "output_tokens": 638, "cache_read_tokens": 0, "cache_creation_tokens": 0, "reasoning_tokens": 441, "total_cost_usd": null}
}

timeline (33 events)

assistant

skill

args

{
  "name": "kernel-optimization"
}

tool

stdout (2962 chars) — TRUNCATED

<skill_content name="kernel-optimization">
# Skill: kernel-optimization

# Kernel Optimization

Unified reference for GPU/TPU kernel work across all platforms. This skill routes to specialized sub-references based on what you need.

## When this skill fires

- Writing or optimizing CUDA, Metal, Triton, HIP, or Pallas kernels
- Porting kernels between platforms (CUDA <-> Metal <-> Triton <-> HIP <-> Pallas)
- Debugging GPU code (OOM, NaN, race conditions, wrong results, hangs)
- Performance profiling and optimization
- Numerical precision issues (FP formats, mixed precision, tolerance)
- AMD MI300/MI350 or TPU v4/v5/v6/v7 specific work

## Sub-references (read on demand)

The following files contain deep platform-specific knowledge. Read the relevant one(s) based on the task:

| File | When to read |
|------|-------------|
| `~/.claude/skills/amd-gpu/SKILL.md` | AMD CDNA/RDNA, ROCm, HIP, MFMA, LDS, chiplet scheduling, HipKittens, AITER |
| `~/.claude/skills/tpu-opt/SKILL.md` | TPU v4-v7, MXU, Pallas kernels, JAX sharding, XLA gotchas, SplashAttention |
| `~/.claude/skills/port-kernel/SKILL.md` | Cross-platform porting: CUDA <-> Metal <-> Triton concept maps, API equivalents, gotchas |
| `~/.claude/skills/debug-gpu/SKILL.md` | OOM diagnosis, NaN tracing, race conditions, compute-sanitizer, FP format specs, tolerance tables |
| `~/.claude/skills/perf-optim/SKILL.md` | Profiling (nsys/ncu/Metal), roofline, occupancy, memory hierarchy, CUTLASS/CuTe, quantization, KernelBench |
| `

... (1462 more chars)

assistant

in 16685out 32cache_r 0cache_w 0

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention"
}

tool

stdout (212 chars)

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention</path>
<type>directory</type>
<entries>
benchmark.py
check.py
problem.yaml
PROMPT.txt
reference.py
shapes.py
sota.py

(7 entries)
</entries>

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/reference.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/reference.py"
}

tool

stdout (5926 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/reference.py</path>
<type>file</type>
<content>
1: """Naive PyTorch paged-attention decode reference (correctness oracle, not SOTA).
2: 
3: Single-query decode: each batch element has a query of shape (num_heads, head_dim)
4: and attends over a KV cache of `seq_len[b]` tokens stored as fixed-size pages in
5: a global pool. Pages for batch element b are listed in `block_table[b]`.
6: 
7: The reference performs the slow path:
8:   1. Gather pages -> contiguous (seq_len, num_kv_heads, head_dim) per batch element.
9:   2. Repeat KV heads for grouped-query (broadcast num_kv_heads -> num_heads).
10:   3. Manual softmax(QK^T / sqrt(d)) @ V in fp32, cast back to bf16.
11: 
12: This avoids torch.nn.functional.scaled_dot_product_attention (which is on the
13: forbidden list) so the agent cannot dispatch through SDPA either.
14: """
15: import math
16: 
17: import torch
18: import torch.nn as nn
19: 
20: OP_TYPE = "attention"
21: SUPPORTED_PRECISIONS = ["bf16"]
22: HARDWARE_REQUIRED = ["RTX_PRO_6000", "H100", "B200"]
23: 
24: 
25: # --- Shape knobs (overridden by check.py / benchmark.py from shapes.py) ----
26: BATCH = 8
27: NUM_HEADS = 32
28: NUM_KV_HEADS = 8
29: HEAD_DIM = 128
30: SEQ_LEN = 1024
31: PAGE_SIZE = 16
32: 
33: 
34: class Model(nn.Module):
35:     """Single-query paged attention decode.
36: 
37:     Forward inputs (all on device):
38:       query:       (batch, num_heads, head_dim)               bf16
3

... (4426 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/check.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/check.py"
}

tool

stdout (4037 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/check.py</path>
<type>file</type>
<content>
1: """Correctness runner for paged-attention decode.
2: 
3: Runs solution.Model vs reference.Model across all shapes in shapes.py, 3 seeds
4: each, with per-dtype atol/rtol. Also rejects forbidden ops by grep.
5: """
6: import re
7: import sys
8: from pathlib import Path
9: 
10: import torch
11: import yaml
12: 
13: REPO_ROOT = Path(__file__).resolve().parents[2]
14: sys.path.insert(0, str(REPO_ROOT))
15: 
16: from src.eval.correctness import check_correctness  # noqa: E402
17: 
18: 
19: def _apply_shape(reference, shape: dict) -> None:
20:     reference.BATCH = shape["batch"]
21:     reference.NUM_HEADS = shape["num_heads"]
22:     reference.NUM_KV_HEADS = shape["num_kv_heads"]
23:     reference.HEAD_DIM = shape["head_dim"]
24:     reference.SEQ_LEN = shape["seq_len"]
25:     reference.PAGE_SIZE = shape["page_size"]
26: 
27: 
28: def main():
29:     try:
30:         import reference
31:         import shapes
32:         import solution
33:     except Exception as e:
34:         print(f"FAIL: import error: {e}")
35:         sys.exit(1)
36: 
37:     problem_yaml = Path("problem.yaml")
38:     meta = yaml.safe_load(problem_yaml.read_text()) if problem_yaml.exists() else {}
39: 
40:     sol_src = Path("solution.py").read_text() if Path("solution.py").exists() else ""
41:     for forbidden in meta.get("forbidden", []):
42:         pat = re.escape(forbidden)
43:     

... (2537 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/benchmark.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/benchmark.py"
}

tool

stdout (5063 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/benchmark.py</path>
<type>file</type>
<content>
1: """Roofline benchmark for paged-attention decode.
2: 
3: For each shape: times eager reference, compiled reference, SOTA (if available),
4: and the agent's solution. Reports achieved TFLOPS, GB/s, and peak_fraction.
5: 
6: Decode is memory-bound, so peak_fraction is computed from achieved GB/s vs
7: the GPU's peak DRAM bandwidth.
8: """
9: import sys
10: from math import exp, log
11: from pathlib import Path
12: 
13: import torch
14: import yaml
15: 
16: REPO_ROOT = Path(__file__).resolve().parents[2]
17: sys.path.insert(0, str(REPO_ROOT))
18: 
19: from src.eval.roofline import compute_gbps, compute_tflops, peak_fraction  # noqa: E402
20: from src.eval.timing import time_fn  # noqa: E402
21: from src.hardware import get as get_hw  # noqa: E402
22: 
23: 
24: def _eval_formula(expr: str, vars: dict) -> float:
25:     return float(eval(expr, {"__builtins__": {}}, vars))
26: 
27: 
28: def _apply_shape(reference, shape: dict) -> None:
29:     reference.BATCH = shape["batch"]
30:     reference.NUM_HEADS = shape["num_heads"]
31:     reference.NUM_KV_HEADS = shape["num_kv_heads"]
32:     reference.HEAD_DIM = shape["head_dim"]
33:     reference.SEQ_LEN = shape["seq_len"]
34:     reference.PAGE_SIZE = shape["page_size"]
35: 
36: 
37: def main():
38:     import reference
39:     import shapes
40:     import solution
41: 
42:     meta = yaml.safe_load(Path("problem.y

... (3563 more chars)

assistant

in 17637out 187cache_r 0cache_w 0

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/shapes.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/shapes.py"
}

tool

stdout (1155 chars)

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/shapes.py</path>
<type>file</type>
<content>
1: """Shape sweep for paged attention decode.
2: 
3: Mix targets:
4:   - small batch / long context (Llama-3 8B-style decode)
5:   - large batch / medium context (server batched decode)
6:   - GQA wide ratio (Llama-3 70B: 64 heads / 8 kv-heads)
7:   - non-power-of-2 seq_len (forces predicated tail handling)
8:   - head_dim=64 small-head case
9: """
10: 
11: SHAPES = [
12:     # (B, H, Hkv, D, L, P)
13:     {"batch": 8,  "num_heads": 32, "num_kv_heads": 8,  "head_dim": 128, "seq_len": 1024, "page_size": 16},
14:     {"batch": 32, "num_heads": 32, "num_kv_heads": 8,  "head_dim": 128, "seq_len": 2048, "page_size": 16},
15:     {"batch": 4,  "num_heads": 64, "num_kv_heads": 8,  "head_dim": 128, "seq_len": 4096, "page_size": 16},
16:     {"batch": 16, "num_heads": 32, "num_kv_heads": 8,  "head_dim": 128, "seq_len": 1535, "page_size": 16},  # non-pow2
17:     {"batch": 8,  "num_heads": 16, "num_kv_heads": 4,  "head_dim": 64,  "seq_len": 2000, "page_size": 16},  # small-D, non-pow2
18: ]

(End of file - total 18 lines)
</content>

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/problem.yaml

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/problem.yaml"
}

tool

stdout (2375 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/problem.yaml</path>
<type>file</type>
<content>
1: name: 03_paged_attention
2: display_name: "Paged Attention Decode"
3: precision: bf16
4: regime: memory  # decode is bandwidth-bound (KV cache streamed once per token)
5: 
6: # Effective FLOPs per call: QK^T + softmax*V across all batches/heads/seq.
7: # Sum-formula evaluated per-shape; the benchmark substitutes batch/num_heads/etc.
8: # 2 * B * H * L * D for QK^T plus 2 * B * H * L * D for AV  =>  4 * B * H * L * D
9: flops_formula: "4 * batch * num_heads * seq_len * head_dim"
10: 
11: # Bytes moved (the real bottleneck): the KV cache must be streamed end-to-end.
12: #   K and V each:  batch * seq_len * num_kv_heads * head_dim * 2 bytes (bf16)
13: #   Q: batch * num_heads * head_dim * 2  (negligible vs KV)
14: #   Out: same as Q
15: # So total ~ 2 * (B * L * Hkv * D * 2) + small.
16: bytes_formula: "2 * batch * seq_len * num_kv_heads * head_dim * 2 + batch * num_heads * head_dim * 2 * 2"
17: 
18: hardware: [RTX_PRO_6000]
19: peak_tflops_key: bf16          # informational; regime=memory uses bandwidth
20: peak_bandwidth_key: dram        # 1.8 TB/s GDDR7 on RTX PRO 6000 Blackwell
21: 
22: tolerance:
23:   bfloat16: 0.02
24:   float16: 0.02
25: 
26: # Forbidden ops -- using any of these in solution.py fails correctness post-hoc.
27: # vllm.attention / flashinfer.batch_decode_with_paged_kv_cache: dispatching to
28: # the SOTA library is the trivial cheat. SD

... (875 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/sota.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/sota.py"
}

tool

stdout (3303 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/problems/03_paged_attention/sota.py</path>
<type>file</type>
<content>
1: """SOTA reference for paged-attention decode.
2: 
3: Tries, in order:
4:   1. FlashInfer's BatchDecodeWithPagedKVCacheWrapper (preferred -- portable,
5:      supports SM120, GQA, arbitrary head_dim).
6:   2. vLLM's paged_attention_v2 CUDA op (requires its KV-cache layout, more
7:      finicky; we adapt the layout on the fly when possible).
8: 
9: If neither is importable, is_available() returns False and the benchmark just
10: reports eager + compiled + solution.
11: 
12: Agents are FORBIDDEN from importing these in solution.py (see problem.yaml).
13: This file is only for the benchmark's reference line.
14: """
15: from __future__ import annotations
16: 
17: import torch
18: 
19: 
20: def _try_flashinfer(
21:     query: torch.Tensor,
22:     kv_cache: torch.Tensor,
23:     block_table: torch.Tensor,
24:     seq_lens: torch.Tensor,
25:     num_kv_heads: int,
26:     head_dim: int,
27:     page_size: int,
28: ) -> torch.Tensor | None:
29:     try:
30:         import flashinfer  # noqa: F401
31:         from flashinfer.decode import BatchDecodeWithPagedKVCacheWrapper
32:     except Exception:
33:         return None
34: 
35:     B, H, D = query.shape
36:     # FlashInfer expects K and V as separate (num_blocks, page_size, num_kv_heads, head_dim) tensors.
37:     # Our reference packs [K|V] on the last dim -- split here.
38:     k_cache = kv_cache[..., :D].contig

... (1803 more chars)

assistant

in 23031out 144cache_r 0cache_w 0

assistant

read /home/infatoshi/cuda/KernelBench-Hard/src/eval/correctness.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/src/eval/correctness.py"
}

tool

stdout (4080 chars) — TRUNCATED

<path>/home/infatoshi/cuda/KernelBench-Hard/src/eval/correctness.py</path>
<type>file</type>
<content>
1: """Per-dtype correctness tolerance.
2: 
3: Stanford's KernelBench uses 1e-4 atol/rtol for fp32 and 1e-2 for fp16/bf16.
4: We extend to fp8 (0.1, permissive) and int (exact).
5: """
6: from __future__ import annotations
7: 
8: import torch
9: 
10: DEFAULT_TOLERANCE = {
11:     torch.float32: {"atol": 1e-4, "rtol": 1e-4},
12:     torch.float16: {"atol": 1e-2, "rtol": 1e-2},
13:     torch.bfloat16: {"atol": 1e-2, "rtol": 1e-2},
14:     torch.float8_e4m3fn: {"atol": 1e-1, "rtol": 1e-1},
15:     torch.float8_e5m2: {"atol": 1e-1, "rtol": 1e-1},
16:     torch.int8: {"atol": 0, "rtol": 0},
17:     torch.int32: {"atol": 0, "rtol": 0},
18:     torch.int64: {"atol": 0, "rtol": 0},
19: }
20: 
21: 
22: def _coerce_float(x):
23:     """Accept int/float/str scientific-notation as float; pass through dicts."""
24:     if isinstance(x, (int, float)):
25:         return float(x)
26:     if isinstance(x, str):
27:         try:
28:             return float(x)
29:         except ValueError:
30:             return x
31:     return x
32: 
33: 
34: def tolerance_for_dtype(dtype: torch.dtype, override: dict | None = None) -> dict:
35:     """Lookup atol/rtol for a given dtype, with optional per-problem override.
36: 
37:     The override dict is keyed by str(dtype). Values may be:
38:       - a single number (or a string parseable as float, e.g. "5e-2") -> used
39:         as both atol and rtol
4

... (2580 more chars)

assistant

read /home/infatoshi/cuda/KernelBench-Hard/src/hardware.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/src/hardware.py"
}

assistant

Now let me understand the eval infrastructure and hardware specs:

assistant

bash

args

{
  "command": "nvidia-smi --query-gpu=name,memory.total,compute_cap,clocks.sm --format=csv,noheader 2>/dev/null | head -5",
  "description": "Check GPU availability and specs"
}

tool

stdout (76 chars)

NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 97887 MiB, 12.0, 180 MHz

assistant

in 25708out 175cache_r 0cache_w 0

assistant

read /home/infatoshi/cuda/KernelBench-Hard/src/hardware

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/src/hardware"
}

tool

stdout (171 chars)

<path>/home/infatoshi/cuda/KernelBench-Hard/src/hardware</path>
<type>directory</type>
<entries>
__init__.py
__pycache__/
m4_max.py
rtx_pro_6000.py

(4 entries)
</entries>

assistant

in 27464out 44cache_r 0cache_w 0

assistant

read /home/infatoshi/cuda/KernelBench-Hard/src/hardware/rtx_pro_6000.py

args

{
  "filePath": "/home/infatoshi/cuda/KernelBench-Hard/src/hardware/rtx_pro_6000.py"
}

tool

stdout (1170 chars)

<path>/home/infatoshi/cuda/KernelBench-Hard/src/hardware/rtx_pro_6000.py</path>
<type>file</type>
<content>
1: """RTX PRO 6000 Blackwell Workstation — SM120, consumer-lineage Blackwell.
2: 
3: Peak tensor-core throughputs are dense-matrix advertised peaks. Actual kernels
4: will see 60-85% of peak on well-tuned code.
5: """
6: from dataclasses import dataclass
7: 
8: 
9: @dataclass(frozen=True)
10: class HardwareTarget:
11:     name: str
12:     sm: str
13:     vram_gb: int
14:     peak_bandwidth_gb_s: float  # DRAM
15:     peak_tflops_dense: dict[str, float]  # dtype -> TFLOPS
16: 
17: 
18: RTX_PRO_6000 = HardwareTarget(
19:     name="RTX PRO 6000 Blackwell Workstation",
20:     sm="sm_120a",
21:     vram_gb=96,
22:     peak_bandwidth_gb_s=1800.0,
23:     peak_tflops_dense={
24:         "fp4": 800.0,
25:         "nvfp4": 800.0,
26:         "mxfp4": 800.0,
27:         "fp6": 800.0,
28:         "fp8": 400.0,
29:         "bf16": 200.0,
30:         "fp16": 200.0,
31:         "tf32": 100.0,
32:         "fp32": 12.0,  # non-tensor-core SIMT fp32
33:         "int8": 400.0,
34:         "int4": 800.0,
35:     },
36: )

(End of file - total 36 lines)
</content>

assistant

in 27617out 54cache_r 0cache_w 0

assistant

in 28223out 2cache_r 0cache_w 0