TriMul CUDA Kernel Optimization

Correctness per stepiBest‑kernel correctness over test‑time RL, read as exploration vs. exploitation. After a warm‑up where no kernel yet passes, correctness climbs as the policy learns to emit valid Triton and plateaus near its peak. It then dips sharply mid‑run: the entropic, max‑reward objective and PUCT initial‑state selection push the search to branch off its best kernels into faster but riskier rewrites, so many variants temporarily fail. Exploitation then re‑anchors on the validated high‑reward kernels and correctness climbs back to its highs — now at much lower latency. The dip is exploration buying the later speedup, not a regression. See the TTT‑Discover paper, “Learning to Discover at Test Time.”

Model

gpt-oss-120b (Tinker)

Learning rate

4e-5 (constant)

PUCT buffer

82 → 216

Episodes / step

32 (16 failed)

Reference latency

1500µs (reward = 1.0)

Hardware

H100 SXM5, CUDA 12.4

Checkpoint store

Tinker / HF Hub

TriMul Outgoing — CUDA Kernel OptimizationRUN_002