RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Developments

Training LLMs to use inference-time feedback using large scale RL. Makes even the 8B Llama3.1 beat GPT-4 on CodeContests, and SOTA with the 70B.

Author summary:

LLMs for code should do much better if they can iterate on tests -- but they don't. Our new work (RLEF) addresses this with execution feedback at RL training time to use execution feedback at inference time.

Notably, RLEF models are very sample efficient for inference. Competitive programming questions are often approached by sampling a large number of candidate programs; we can reach SOTA with just up to 3 samples.

image

CGPO - Constrained Generative Policy optimization

Share link! 📋
Link copied!
See the main site!