kernelbench.com

mega

KernelBench-Mega · whole-block megakernels● 3-hour ceiling · RTX PRO 6000 Blackwell + H100 + B200

KernelBench-Mega tests whole-block megakernels: instead of grading a single isolated op, the agent fuses an entire model block into one kernel. Problem 03_kimi_linear_decode is a Kimi-Linear W4A16 hybrid decode (4-bit weights, bf16 activations). The headline metric is the decode speedup over an optimized-PyTorch baseline (e.g. 19.35x = 19x faster than the reference), not a 0-1 roofline fraction; tok/s is decode tokens per second. Higher is better for both, and results are reported per GPU. The transcript is the headline artifact: it shows the model's full optimization journey from baseline to the final megakernel.

Each run gets a single autonomous session under a 3-hour wall-clock ceiling; models self-terminate well under it (the longest run so far is ~2.5h). All cells use the same ceiling, so the board is comparable. An empty speedup is a 3-hour-timeout DNF.

loading…
gpumodelharnesscorrectframeworkfilesconversation

speedup = decode speedup over an optimized-PyTorch baseline (bar width normalized to the fastest run on the board); the per-ctx breakdown shows speedup at 2k / 8k / 16k decode context. Top speedup per GPU is highlighted. Browse the run index for transcripts and solutions, or the mega benchmark source.