V-STaR: Training Verifiers for Self-Taught Reasoners

Development The authors show that thea bility to use both correct and incorrect solutionsduring training improves training a verifier with DPO to judge correctness of model-generated solutions. This can result in improvement of 4-17% improvement in test accuracy.