TriMul Outgoing — CUDA Kernel Optimization
RUN_002
Steps 0–62 · H100 SXM5
lr = 4e-5 · PUCT buffer
2025-05-25
↗ W&B (original)
↗ W&B (fork)
1185
µs
Best kernel latency
1.265
Peak reward (step 48)
90.6
%
Peak correctness
63
steps
63 steps
Kernel latency & reward over training steps
Correctness per step
Step —
✕
Metrics
Code
Select a step to view its code.
►
Test output / compiler message
Model
gpt-oss-120b (Tinker)
Learning rate
4e-5 (constant)
PUCT buffer
82 → 216
Episodes / step
32 (16 failed)
Reference latency
1500µs (reward = 1.0)
Hardware
H100 SXM5, CUDA 12.4
Checkpoint store
Tinker / HF Hub