CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Leverage (statistics)
Feature (linguistics)
Baseline (sea)
Modality (human–computer interaction)
DOI:
10.48550/arxiv.2305.14014
Publication Date:
2023-01-01
AUTHORS (5)
ABSTRACT
Pre-trained vision-language models~(VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, visual despite potential of VLMs to serve as powerful readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) in images. With such merits, we transform into reader introduce CLIP4STR, simple yet effective STR method built upon image encoders CLIP. It has two encoder-decoder branches: branch cross-modal branch. The provides an initial prediction based feature, refines this by addressing discrepancy between feature semantics. To fully leverage capabilities both branches, design dual predict-and-refine decoding scheme inference. We scale CLIP4STR terms model size, pre-training data, training achieving state-of-the-art performance 13 benchmarks. Additionally, comprehensive empirical study is provided enhance understanding adaptation STR. Our establishes strong baseline future research with VLMs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....