|
|
Yilong LiEmail / GitHub / Google Scholar / LinkedIn |
|
Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning |
|
Yilong Li, Suman Banerjee, Tong Che
Repeated sampling with a verifier is a standard way to allocate test-time compute for code generation, but drawing K independent samples from one answer distribution often wastes the pass@K budget on near-duplicate reasoning paths. This paper introduces Coordinated Pass@K Policy Optimization (CPPO), which turns pass@K generation into joint exploration over strategies. A planner emits a tuple of alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint planner-solver policy with a multiplicative planner reward that gives credit only when valid strategy tuples lead to verifier-confirmed pass@K success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@4 over direct sampling, planning baselines, planner-only SFT, and pass@K-oriented reinforcement learning under the same solver-attempt budget. |
AbstractRepeated sampling with a verifier is a standard way to allocate test-time compute for code generation, but drawing K independent samples from one answer distribution often wastes the pass@K budget on near-duplicate reasoning paths. This paper introduces Coordinated Pass@K Policy Optimization (CPPO), which turns pass@K generation into joint exploration over strategies. A planner emits a tuple of alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint planner-solver policy with a multiplicative planner reward that gives credit only when valid strategy tuples lead to verifier-confirmed pass@K success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@4 over direct sampling, planning baselines, planner-only SFT, and pass@K-oriented reinforcement learning under the same solver-attempt budget. |