NFDI4DS | UHH-SEMS - Publication Details

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Leverage (statistics) Feature (linguistics) Baseline (sea) Modality (human–computer interaction)

DOI: 10.48550/arxiv.2305.14014 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Shuai Zhao

Xiaohan Wang

Linchao Zhu

Ruijie Quan

Yi Yang

ABSTRACT

Pre-trained vision-language models~(VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, visual despite potential of VLMs to serve as powerful readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) in images. With such merits, we transform into reader introduce CLIP4STR, simple yet effective STR method built upon image encoders CLIP. It has two encoder-decoder branches: branch cross-modal branch. The provides an initial prediction based feature, refines this by addressing discrepancy between feature semantics. To fully leverage capabilities both branches, design dual predict-and-refine decoding scheme inference. We scale CLIP4STR terms model size, pre-training data, training achieving state-of-the-art performance 13 benchmarks. Additionally, comprehensive empirical study is provided enhance understanding adaptation STR. Our establishes strong baseline future research with VLMs.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....