RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

FOS: Computer and information sciences Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computation and Language (cs.CL)
DOI: 10.48550/arxiv.2402.13463 Publication Date: 2024-02-20
ABSTRACT
The application scope of large language models (LLMs) is increasingly expanding. In practical use, users might provide feedback based on the model's output, hoping for a responsive model that can complete responses according to their feedback. Whether appropriately respond users' refuting and consistently follow through with execution has not been thoroughly analyzed. light this, this paper proposes comprehensive benchmark, RefuteBench, covering tasks such as question answering, machine translation, email writing. evaluation aims assess whether positively accept in form instructions they adhere user demands throughout conversation. We conduct evaluations numerous LLMs find are stubborn, i.e. exhibit inclination internal knowledge, often failing comply Additionally, length conversation increases, gradually forget user's stated roll back own responses. further propose recall-and-repeat prompts simple effective way enhance responsiveness
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....