NFDI4DS | UHH-SEMS - Publication Details

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Benchmark (surveying) Benchmarking

DOI: 10.48550/arxiv.2311.09184 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (10)

Yixin Liu

Alexander R. Fabbri

Jiawen Chen

Yilun Zhao

Simeng Han

Shafiq Joty

Pengfei Liu

Dragomir Radev

Chien-Sheng Wu

Arman Cohan

ABSTRACT

While large language models (LLMs) already achieve strong performance on standard generic summarization benchmarks, their more complex task settings is less studied. Therefore, we benchmark LLMs instruction controllable text summarization, where the model input consists of both a source article and natural requirement for desired summary characteristics. To this end, curate an evaluation-only dataset setting conduct human evaluation 5 LLM-based systems. We then automatic with 4 different protocols 11 LLMs, resulting in 40 methods total. Our study reveals that remains challenging since (1) all evaluated still make factual other types errors summaries; (2) cannot alignment annotators when judging quality candidate (3) show gaps generation evaluation. our collected benchmark, InstruSum, publicly available to facilitate future research direction.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....