VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition 0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology Computation and Language (cs.CL) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.1707.05612 Publication Date: 2017-01-01
ABSTRACT
Accepted as spotlight presentation at British Machine Vision Conference (BMVC) 2018. Code: https://github.com/fartashf/vsepp<br/>We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).<br/>
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....