V-STaR: Training Verifiers for Self-Taught Reasoners

Development The authors show that thea bility to use both correct and incorrect solutionsduring training improves training a verifier with DPO to judge correctness of model-generated solutions. This can result in improvement of 4-17% improvement in test accuracy. image image

Share link! 📋
Link copied!
See the main site!