TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

Spotting
DOI: 10.48550/arxiv.2403.10047 Publication Date: 2024-03-15
ABSTRACT
Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter achieve precise detection recognition of simultaneously. Inspired by the glimpse-focus spotting pipeline human beings impressive performances Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot without just like beings?", if yes, 2) "Is block another alternative other than word or character?" To this end, our proposed leverages advanced PLMs enhance performance fine-grained detection. Specifically, first use simple detector block-level obtain rough positional information. Then, finetune PLM using large-scale OCR dataset accurate recognition. Benefiting comprehensive language knowledge gained during pre-training phase, PLM-based module effectively handles complex scenarios, including multi-line, reversed, occluded, incomplete-detection texts. Taking advantage fine-tuned model benchmarks paradigm detection, extensive experiments demonstrate superior across multiple public benchmarks. Additionally, attempt directly an entire image potential PLMs, even Large (LLMs).
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....