NFDI4DS | UHH-SEMS - Publication Details

Evaluating the Performance of Large Language Models in Identifying Human Facial Emotions: GPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet

Sonnet

DOI: 10.31234/osf.io/pxq5h_v1 Publication Date: 2025-04-01T21:01:54Z

Abstract Supplemental Material References Cited by

AUTHORS (6)

Benjamin W Nelson

Ari Winbush

Steven Siddals

John Torous

Nicholas B. Allen

Matthew Flathers

ABSTRACT

Background. Evaluating the social and emotional capabilities of large language models (LLMs), such as their ability to recognize human facial emotion is critical role in human-computer interactions (HCIs) expands, particularly healthcare applications. Facial expressions convey affective clinical information, useful for detecting emotions, contextualizing language, understanding interpersonal dynamics, identifying potential mental health neurocognitive disorders. However, LLMs' accurately interpret remains unclear.Methods. We evaluated agreement accuracy three leading LLM models, GPT-4o, Gemini 2.0 Experimental, Claude 3.5 Sonnet, using NimStim dataset, a benchmark 672 (calm, angry, happy, fear, sad, neutral, surprise, disgust) from 43 diverse actors resulting 2,016 model-based estimates.Results. All demonstrated substantial almost perfect with ground truth labels. Happy had highest agreement, while fear lowest due high misclassification surprise. GPT-4o accuracy, 95% CI lower bound exceeding 0.80, performed more poorly. There were no significant differences function actor sex or race. reached performance levels overall recognition surpassing calm/neutral surprise recognition, surpassed recognition.Conclusion. As GenAI increasingly mediate HCI expand into healthcare, evaluating LLM’s socioemotional comprehension crucial This study found that LLMs perform strongly relative comparably judges recognizing prototypical expressions, showing especially strong performance. lays groundwork highlights need address existing gaps safe applications future settings.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (0)

EXTERNAL LINKS

CROSSREF - Publications OPENALEX - Publications

PlumX Metrics

Evaluating the Performance of Large Language Models in Identifying Human Facial Emotions: GPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....