Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Benchmark (surveying) Benchmarking
DOI: 10.48550/arxiv.2311.09184 Publication Date: 2023-01-01
ABSTRACT
While large language models (LLMs) already achieve strong performance on standard generic summarization benchmarks, their more complex task settings is less studied. Therefore, we benchmark LLMs instruction controllable text summarization, where the model input consists of both a source article and natural requirement for desired summary characteristics. To this end, curate an evaluation-only dataset setting conduct human evaluation 5 LLM-based systems. We then automatic with 4 different protocols 11 LLMs, resulting in 40 methods total. Our study reveals that remains challenging since (1) all evaluated still make factual other types errors summaries; (2) cannot alignment annotators when judging quality candidate (3) show gaps generation evaluation. our collected benchmark, InstruSum, publicly available to facilitate future research direction.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()