Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

Pointwise mutual information Pointwise
DOI: 10.48550/arxiv.2404.19228 Publication Date: 2024-04-29
ABSTRACT
Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications. The symmetric InfoNCE loss proposed in CLIP a key concept multimodal learning. In this work, we provide theoretical understanding of the through lens pointwise mutual information show that encoders achieve optimal similarity pretraining good downstream classification tasks under mild assumptions. Based on our results, also propose new metric contrastive by utilizing nonlinear kernel enrich capability. To verify effectiveness method, demonstrate models Conceptual Caption datasets evaluate zero-shot linear common benchmark datasets.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....