Evaluating the Performance of Large Language Models in Identifying Human Facial Emotions: GPT 4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet

Sonnet
DOI: 10.31234/osf.io/pxq5h_v1 Publication Date: 2025-04-01T21:01:54Z
ABSTRACT
Background. Evaluating the social and emotional capabilities of large language models (LLMs), such as their ability to recognize human facial emotion is critical role in human-computer interactions (HCIs) expands, particularly healthcare applications. Facial expressions convey affective clinical information, useful for detecting emotions, contextualizing language, understanding interpersonal dynamics, identifying potential mental health neurocognitive disorders. However, LLMs' accurately interpret remains unclear.Methods. We evaluated agreement accuracy three leading LLM models, GPT-4o, Gemini 2.0 Experimental, Claude 3.5 Sonnet, using NimStim dataset, a benchmark 672 (calm, angry, happy, fear, sad, neutral, surprise, disgust) from 43 diverse actors resulting 2,016 model-based estimates.Results. All demonstrated substantial almost perfect with ground truth labels. Happy had highest agreement, while fear lowest due high misclassification surprise. GPT-4o accuracy, 95% CI lower bound exceeding 0.80, performed more poorly. There were no significant differences function actor sex or race. reached performance levels overall recognition surpassing calm/neutral surprise recognition, surpassed recognition.Conclusion. As GenAI increasingly mediate HCI expand into healthcare, evaluating LLM’s socioemotional comprehension crucial This study found that LLMs perform strongly relative comparably judges recognizing prototypical expressions, showing especially strong performance. lays groundwork highlights need address existing gaps safe applications future settings.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)