HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings

Developments The authors create HELM instruct to use multiple LLMs to evaluate multiple model for given input instructions. They evaluate around the following criteria: Helpfulness, Understandability, Completeness, COnciseness, and Harmlessness.

The evaluation rubric is as follows

Results They find that GPT-4 generally performs the best in all metrics. Interestingly, however, they do not find high consistency amongst evaluators.

HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings

Generalization ability