Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment

FOS: Computer and information sciences Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.48550/arxiv.2406.16641 Publication Date: 2024-06-24
ABSTRACT
Recently, textual prompt tuning has shown inspirational performance in adapting Contrastive Language-Image Pre-training (CLIP) models to natural image quality assessment. However, such uni-modal learning method only tunes the language branch of CLIP models. This is not enough for AI generated assessment (AGIQA) since AGIs visually differ from images. In addition, consistency between and user input text prompts, which correlates with perceptual AGIs, investigated guide AGIQA. this letter, we propose vision-language guided multi-modal blind AGIQA, dubbed CLIP-AGIQA. Specifically, introduce learnable visual prompts vision branches models, respectively. Moreover, design a text-to-image alignment prediction task, whose learned knowledge used optimization above prompts. Experimental results on two public AGIQA datasets demonstrate that proposed outperforms state-of-the-art The source code available at https://github.com/JunFu1995/CLIP-AGIQA.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....