Yilong Li

Email  /  GitHub  /  Google Scholar  /  LinkedIn

profile photo

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

Yilong Li, Suman Banerjee, Tong Che
arXiv preprint, 2026
2026
arxiv / pdf / link / /

Repeated sampling with a verifier is a standard way to allocate test-time compute for code generation, but drawing K independent samples from one answer distribution often wastes the pass@K budget on near-duplicate reasoning paths. This paper introduces Coordinated Pass@K Policy Optimization (CPPO), which turns pass@K generation into joint exploration over strategies. A planner emits a tuple of alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint planner-solver policy with a multiplicative planner reward that gives credit only when valid strategy tuples lead to verifier-confirmed pass@K success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@4 over direct sampling, planning baselines, planner-only SFT, and pass@K-oriented reinforcement learning under the same solver-attempt budget.

Abstract

Repeated sampling with a verifier is a standard way to allocate test-time compute for code generation, but drawing K independent samples from one answer distribution often wastes the pass@K budget on near-duplicate reasoning paths. This paper introduces Coordinated Pass@K Policy Optimization (CPPO), which turns pass@K generation into joint exploration over strategies. A planner emits a tuple of alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint planner-solver policy with a multiplicative planner reward that gives credit only when valid strategy tuples lead to verifier-confirmed pass@K success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@4 over direct sampling, planning baselines, planner-only SFT, and pass@K-oriented reinforcement learning under the same solver-attempt budget.