Multimodal Pre-Training of Vision Models Yields Better Embeddings for Visual Art

DOI: 10.5617/dhnbpub.12286 Publication Date: 2025-03-03T15:23:16Z
ABSTRACT
Deep pre-trained vision models provide automated ways of analyzing large digitized corpora of visual art. Central to the success of these models is their ability to extract rich embeddings for downstream tasks such as style classification or painting retrieval. Recent results suggest that multimodal models trained on a combination of visual and linguistic input yield semantically enhanced and higher-quality representations of images compared to unimodal vision models. While these multimodal models seem extremely promising for the computational study of visual art, where semantic knowledge might be relevant to produce informative representations of artworks, research benchmarking multimodal models for feature extraction in the art domain is limited, and their potential for representing visual art remains to be explored. This paper aims to fill this gap, and it compares the representational abilities of seven uni-modal and multimodal state-of-the-art pre-trained vision models by employing their embeddings in three domain-specific downstream tasks: genre classification, style classification, and artist classification in the WikiArt dataset. Results reveal that multimodal models perform best as feature extractors for artworks as opposed to unimodal models. We hypothesize that pre-training on natural language descriptions provides multimodal models with enhanced abilities to infer global semantic representations of an image, which is beneficial to identifying key characteristics of artworks such as their genre, author, and style.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....