HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings

Developments The authors create HELM instruct to use multiple LLMs to evaluate multiple model for given input instructions. They evaluate around the following criteria: Helpfulness, Understandability, Completeness, COnciseness, and Harmlessness.

image The evaluation rubric is as follows image

Results They find that GPT-4 generally performs the best in all metrics. Interestingly, however, they do not find high consistency amongst evaluators. image

Generalization ability

Share link! 📋
Link copied!
See the main site!