NFDI4DS | UHH-SEMS - Publication Details

RankCLIP: Ranking-Consistent Language-Image Pretraining

FOS: Computer and information sciences Computer Science - Machine Learning Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2404.09387 Publication Date: 2024-01-01

Abstract Supplemental Material References Cited by

AUTHORS (6)

Zhang, Yiming

Zhao, Zhuokai

Chen, Zhaorun

Feng, Zhili

Ding, Zenghui

Sun, Yining

ABSTRACT

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.<br/>Code and model checkpoints are available at https://github.com/Jam1ezhang/RankCLIP<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

RankCLIP: Ranking-Consistent Language-Image Pretraining

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....