Skip to content

Feedback

In generation models, higher quality is generally found through feedback methods. Because token-generation is greedy, or it generally maximizes the likelihood of the immediate token and not all subsequent tokens, the complete-generation may easily be biased by tokens that are generated that do not lead to more globally optimial responses. Feedback methods are designed to guide the generation of the entire set of next token(s) to more successfully fulfill the intention of calling prompts.

Navigating through a maze of tokens

The process of generating responses can be likened to navigating through a maze of tokens. The final generation token, 'EOF', signifies the end of the output and the completion of a path through the maze, which is the 'destination'. The quality of this path depends on the individual steps taken while navigating the maze. It is possible to take wrong 'turns' in the maze, resulting in a 'wrong' or suboptimal path when the generation arrives at the final destination. This is where feedback comes into play, guiding the path through the maze towards a more correct destination.

Feedback can be provided by humans, referred to as human-feedback (HF), or by AI, known as AI-feedback (AIF), or a combination of both.

[^n1]Note: This is different from recursive_training where a model is used to generate training examples to improve the training of a subsequent model.

Feedback-based model updates can be categorized into two types: those that use reinforcement learning (RL) and those that use RL-free feedback.

Prominent models, like GPT-4, Reinforcement Learning with Human Feedback, RLHF, has enabled some of the most powerful models.

Key Takeaway

Feedback is a technique that trains a model to predict a more optimal sequence of token outputs conditioned on a given input.

Feedback

Feedback is generated from evaluations by people or AI of two or more outputs conditioned on an input prompt. These evaluations can be applied to the entirety of an output or specific portions of it. The evaluation results are then used to optimize the complete path.

In generative models, the quality of output is often enhanced through feedback mechanisms. This is because token-generation is typically a greedy process, maximizing the likelihood of the immediate token without considering the impact on subsequent tokens. As a result, the complete generation can be biased by tokens that do not lead to globally optimal responses. Feedback methods are designed to guide the generation of the entire set of next tokens to more effectively fulfill the intention of the calling prompts.

Reinforcement learning based feedback

Reinforcement Learning (RL) uses the outcomes of a game, also known as a roll-out, to determine how to improve the choices or moves made during the game. In the context of Language Models, these moves are discrete and correspond to the next tokens that are produced.

A policy helps to decide what action or direction to take based on your current state or location. Specifically, a proximal policy predicts a probability distribution over all potential output states, shaping the entire path of the outcome.

The policy model creates a path of tokens that will end with a reward that is closest to the preferred reward. Feedback, generally from humans or other models, is used to update the policy model. However, not all variations of input data can be reasonably considered given the volume of feedback that could be provided.

A reward model is created to estimate how humans would evaluate the output. This model allows general human-informed guidance to help improve the policy model iteratively.

One of the most successful examples of this is Instruct GPT, which follows the process outlined above. This method underlies the basis of Chat-GPT 3 and 4.

Many RL methods use 'outcome' evaluations, but process reward models can be better

Using RL feedback from human labelers to provide feedback on intermediate steps, in Let's Verify Step By Step the authors demonstrate that providing feedback on intermediate steps can yield a reward model that is considerably better on various math-tests, than it is for outcome-based reward models.

(Anthropic) Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

image

RLHF

Training language models to follow instructions with human feedback

Instruct GPT allows for following of instructions. InstructGPT, established a powerful paradigm of LLM performance image

Learning to summarize from human feedback Provides initial successful examples using PPO and human feedback to improve summaries.

RLHF Diagram

Policy

Proximal Policy optimization

There are several policy gradient methods to optimize, a common one being proximal policy optimization, or PPO.

\[ \hat{g} = \hat{\mathbb{E}}_t \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \hat{A}_t \right] \]

TODO: Expand this based on Proximal Policy Optimization Algorithms

Reward Models

A reward model is used to approximate the quality, or reward, that a labeler (a person) might assign to an example output.

While multiple examples may be ranked and used simultaneously, the reward model may be trained by considering only a winning and a losing example. The reward models will produce a \(S_w\) \(S_l\) for winning and losing examples.

The reward model is trained with the objective of incentivizing the winning response to have a lower score than the losing response. More specifically, it minimizes

\[ -E_x(\log(\sigma(s_w-s_l))) \]

TODO: Expand this to include more mathematics.

Process reward models

Much like intermediate points to a ball-game are indicators of the winner of a game, a process reward model approximates the quality of intermediate steps in a total outcome.

Having intermediate rewards provides better guidance on how the token generation occurs before the token termination.

Let's reward step by step; Step-Level Reward Model as the Navigators for Reasoning

image image

RLAIF

Because of the ability to minimize costs associated with feedback, reinforcement Learning from AI Feedback (RLAIF) has proved additionally valuable.

📋
Self-Taught Evaluators

The authors show a method of using

Abstract Model-based evaluation is at the heart of successful model development – as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to improve evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative selfimprovement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples. image

image

Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF provides a solid example using RLAIF generated with GPT-4 to create a 7B model that is almost as good as GPT-4

They also released a data set called Nectar that with over 180k GPT-4 ranked outputs.

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Developments The authors develop both Contrastive Learning from AI Revisions (CLAIR) a data creation method with contrastive preference pairs, and Anchored Preference Optimization (APO) a stable alignment objective. image image

RLEF

📋
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Developments

Training LLMs to use inference-time feedback using large scale RL. Makes even the 8B Llama3.1 beat GPT-4 on CodeContests, and SOTA with the 70B.

Author summary:

LLMs for code should do much better if they can iterate on tests -- but they don't. Our new work (RLEF) addresses this with execution feedback at RL training time to use execution feedback at inference time.

Notably, RLEF models are very sample efficient for inference. Competitive programming questions are often approached by sampling a large number of candidate programs; we can reach SOTA with just up to 3 samples.

image

📋
V-STaR: Training Verifiers for Self-Taught Reasoners

Development The authors show that thea bility to use both correct and incorrect solutionsduring training improves training a verifier with DPO to judge correctness of model-generated solutions. This can result in improvement of 4-17% improvement in test accuracy. image image

CGPO - Constrained Generative Policy optimization

📋
The Perfect Blend: Redefining RLHF with Mixture of Judges

Developments The authors show a novel method of post-tuning feedback training using three new scalable RLHF optimizers to deal with reward hacking in multi-task LLMs. Using two types of judges, rule-based and LLM-based, the sysem is able to evaluate LLM generation and any violation of NLP tasks. For multi task optimization, each task is managed individually with diffeerent optimization settings and reward models, judge mixes, and optimizer hyper paremeters. Thee resulting systme is able to reach SOTA in math, coding, engagemnt and safety.

image

image

RL-free feedback

It is possible to provide feedback without using Reinforcement learning. Using a technique called 'Direct Policy Optimization', DPO, models can be optimize without explicitly generating a reward model for different output prompts. Using this method helps to reduce several challenges associated with RL, including the need to iteratively train reward models, and any stability challenges that are offen associated with reinforcement learning.

TODO: INtegrate this: https://arxiv.org/pdf/2305.18290.pdf

TODO

Literature to read and integrate : https://arxiv.org/pdf/2211.14275.pdf https://arxiv.org/pdf/2308.01825.pdf