Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

Formalism (music)
DOI: 10.48550/arxiv.2309.15129 Publication Date: 2023-01-01
ABSTRACT
Recently an influx of studies claim emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination training sets, or lack systematic Evaluation involving multiple tasks, control conditions, iterations, and statistical robustness tests. Here we make two major contributions. First, propose CogEval, a science-inspired protocol for the evaluation capacities Large Language Models. The CogEval can be followed various abilities. Second, here follow to systematically evaluate maps planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, Alpaca-7B). We base our task prompts human experiments, which offer both established construct validity evaluating planning, are absent from LLM sets. find that, while show apparent competence few tasks with simpler structures, reveals striking failure modes including hallucinations invalid trajectories getting trapped loops. These findings do not support idea out-of-the-box LLMs. This could because understand latent relational structures underlying problems, known as maps, fail at unrolling goal-directed based structure. Implications application future directions discussed.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....