Can Vision Language Models Follow Human Gaze
DOI:
10.31219/osf.io/c9xvn_v1
Publication Date:
2025-04-22T18:14:10Z
AUTHORS (10)
ABSTRACT
Gaze understanding is suggested as a precursor to inferring intentions and engaging in joint attention, core capacities for a theory of mind, social learning, and language acquisition.As Vision Language Models (VLMs) become increasingly promising in interactive applications, assessing whether they master this foundational socio-cognitive skill becomes vital.Rather than creating a benchmark, we aim to probe the behavioral features of the underlying gaze understanding. We curated a set of images with systematically controlled difficulty and variability, evaluated 111 VLMs' abilities to infer gaze referents, and analyzed their performance using mixed-effect models.Only 20 VLMs performed above chance, with still low overall accuracy.We further analyzed 4 of these top-tier VLMs and found that their performance declined with increasing task difficulty but varied only slightly with the specific prompt and gazer.While their gaze understanding remains far from mature, the patterns suggest that their inferences are far different than merely stochastic parroting.This early progress highlights the need for mechanistic investigations of their underlying emergent inference.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....