HumanELY: Human Evaluation of Large Language model Yield

To provide a structured way to perform human evaluation, we propose the first and the most comprehensive guidance and a web application called HumanELY. Our approach and tools derived from commonly used evaluation metrics helps perform evaluation of large language model outputs in a comprehensive, consistent, measurable and comparable manner.

HumanELY comprises of 5 key metrics of relevance, coverage, coherence, harm and comparison. Additional sub metrics within these metrics provide for a Likert scale based human evaluation of LLM outputs.

Cite us: Awasthi, R., S. Mishra, D. Mahapatra, A. Khanna, K. Maheshwari, J. Cywinski, F. Papay and P. Mathur (2023). "HumanELY: Human evaluation of LLM yield, using a novel web-based evaluation tool." medRxiv: 2023.2012.2022.23300458.

Click here to use HumanELY application