NFDI4DS | UHH-SEMS - Publication Details

Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning

FOS: Computer and information sciences Computer Science - Computation and Language Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Computation and Language (cs.CL)

DOI: 10.48550/arxiv.2410.05928 Publication Date: 2024-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Singh, Ayush

Gupta, Mansi

Garg, Shivank

Kumar, Abhinav

Agrawal, Vansh

ABSTRACT

Vision-Language Models (VLMs) have transformed tasks requiring visual and reasoning abilities, such as image retrieval and Visual Question Answering (VQA). Despite their success, VLMs face significant challenges with tasks involving geometric reasoning, algebraic problem-solving, and counting. These limitations stem from difficulties effectively integrating multiple modalities and accurately interpreting geometry-related tasks. Various works claim that introducing a captioning pipeline before VQA tasks enhances performance. We incorporated this pipeline for tasks involving geometry, algebra, and counting. We found that captioning results are not generalizable, specifically with larger VLMs primarily trained on downstream QnA tasks showing random performance on math-related challenges. However, we present a promising alternative: task-based prompting, enriching the prompt with task-specific guidance. This approach shows promise and proves more effective than direct captioning methods for math-heavy problems.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

Beyond Captioning: Task-Specific Prompting for Improved VLM Performance in Mathematical Reasoning

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....