The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

FOS: Computer and information sciences Computer Science - Computation and Language Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Computation and Language (cs.CL)
DOI: 10.48550/arxiv.2309.17421 Publication Date: 2023-01-01
ABSTRACT
Large multimodal models (LMMs) extend large language (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), deepen understanding of LMMs. The analysis focuses on intriguing tasks that GPT-4V can perform, containing test samples probe quality and genericity GPT-4V's capabilities, its supported inputs working modes, effective ways prompt model. our approach exploring GPT-4V, curate organize a collection carefully designed qualitative spanning variety domains tasks. Observations from these demonstrate unprecedented ability in processing arbitrarily interleaved capabilities together make powerful generalist system. Furthermore, unique capability markers drawn input images give rise new human-computer interaction methods referring prompting. We conclude report in-depth discussions emerging application scenarios future research directions for GPT-4V-based systems. hope preliminary exploration will inspire next-generation task formulation, exploit enhance LMMs solve real-world problems, gaining better foundation models. Finally, acknowledge model under study is solely product OpenAI's innovative work, they should be fully credited development. Please see contributions paper authorship credit attribution: https://cdn.openai.com/contributions/gpt-4v.pdf
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....