Benchmarking Large Language Models for News Summarization
Benchmarking
DOI:
10.48550/arxiv.2301.13848
Publication Date:
2023-01-01
AUTHORS (6)
ABSTRACT
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, find instruction tuning, not size, is key to LLM's zero-shot capability. Second, existing studies been limited by low-quality references, leading underestimates of performance lower few-shot finetuning performance. To better evaluate LLMs, perform over high-quality summaries collect from freelance writers. Despite major stylistic differences such as amount paraphrasing, that LMM judged be par with written summaries.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....