NFDI4DS | UHH-SEMS - Publication Details

SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

OPENALEX - Publications

Mingxin Huang Yuliang Liu Zhenghao Peng Chongyu Liu Dahua Lin and 4 more

End-to-end scene text spotting has attracted great attention in recent years due to the success of excavating intrinsic synergy detection and recognition. However, state-of-the-art methods usually incorporate recognition simply by sharing backbone, which does not directly take advantage feature interaction between two tasks. In this paper, we propose a new end-to-end framework termed SwinTextSpotter. Using transformer encoder with dynamic head as detector, unify tasks novel Recognition...

10.1109/cvpr52688.2022.00455 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

SPTS: Single-Point Text Spotting

OPENALEX - Publications

Dezhi Peng Xinyu Wang Yuliang Liu Jiaxin Zhang Mingxin Huang and 7 more

Existing scene text spotting (i.e., end-to-end detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level boxes). For the first time, we demonstrate that training models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose method tackles as sequence prediction task. Given image input, formulate desired recognition results discrete tokens use auto-regressive Transformer to predict...

10.1145/3503161.3547942 article EN Proceedings of the 30th ACM International Conference on Multimedia 2022-10-10

SPTS v2: Single-Point Scene Text Spotting

OPENALEX - Publications

Yuliang Liu Jiaxin Zhang Dezhi Peng Mingxin Huang Xinyu Wang and 6 more

End-to-end scene text spotting has made significant progress due to its intrinsic synergy between detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated quadrangles, polygons a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us train high-performing text-spotting models single-point annotation. v2 reserves the advantage of auto-regressive Transformer with an Instance...

10.1109/tpami.2023.3312285 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-09-05

On the Hidden Mystery of OCR in Large Multimodal Models

OPENALEX - Publications

Yuliang Liu Zhang Li Hongliang Li Wenwen Yu Mingxin Huang and 6 more

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness text-related visual tasks remains relatively unexplored. In this paper, we conducted comprehensive evaluation of Multimodal Models, such as GPT4V Gemini, various including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), Handwritten Mathematical Expression...

10.48550/arxiv.2305.07895 preprint EN cc-by-sa arXiv (Cornell University) 2023-01-01

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

OPENALEX - Publications

Mingxin Huang Jiaxin Zhang Dezhi Peng Hao Lu Can Huang and 3 more

In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown crucial importance of intrinsic synergy between detection and recognition, advances in methods usually adopt an implicit strategy with shared query, which can not fully realize potential these two interactive tasks. this paper, we argue that explicit considering distinct characteristics recognition significantly improve performance spotting. To end,...

10.1109/iccv51070.2023.01786 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

OCRBench: on the hidden mystery of OCR in large multimodal models

OPENALEX - Publications

Yuliang Liu Zhang Li Mingxin Huang Biao Yang Wenwen Yu and 5 more

10.1007/s11432-024-4235-6 article EN Science China Information Sciences 2024-12-01

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-Domain Generalization

OPENALEX - Publications

Yuliang Liu Mingxin Huang Hao Yan Linger Deng Weijia Wu and 4 more

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce new method, termed VimTS, which enhances generalization ability model by achieving better synergy among different tasks. Typically, propose Prompt Queries Generation Module Tasks-aware Adapter to effectively convert original single-task into multi-task suitable for...

10.1109/tpami.2025.3528950 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2025-01-01

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

OPENALEX - Publications

Mingxin Huang Dezhi Peng Hongliang Li Zhenghao Peng Chongyu Liu and 4 more

10.1007/s11263-025-02428-0 article EN International Journal of Computer Vision 2025-04-15

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

OPENALEX - Publications

Mingxin Huang Dezhi Peng Hongliang Li Zhenghao Peng Chongyu Liu and 4 more

End-to-end scene text spotting, which aims to read the in natural images, has garnered significant attention recent years. However, state-of-the-art methods usually incorporate detection and recognition simply by sharing backbone, does not directly take advantage of feature interaction between two tasks. In this paper, we propose a new end-to-end spotting framework termed SwinTextSpotter v2, seeks find better synergy recognition. Specifically, enhance relationship tasks using novel...

10.48550/arxiv.2401.07641 preprint EN other-oa arXiv (Cornell University) 2024-01-01

Hierarchical Side-Tuning for Vision Transformers

OPENALEX - Publications

Weifeng Lin Ziheng Wu Jiayu Chen Wentao Yang Mingxin Huang and 2 more

Fine-tuning pre-trained Vision Transformers (ViT) has consistently demonstrated promising performance in the realm of visual recognition. However, adapting large models to various tasks poses a significant challenge. This challenge arises from need for each model undergo an independent and comprehensive fine-tuning process, leading substantial computational memory demands. While recent advancements Parameter-efficient Transfer Learning (PETL) have their ability achieve superior compared full...

10.48550/arxiv.2310.05393 preprint EN cc-by arXiv (Cornell University) 2023-01-01

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

OPENALEX - Publications

Mingxin Huang Jiaxin Zhang Dezhi Peng Hao Lu Can Huang and 3 more

In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown crucial importance of intrinsic synergy between detection and recognition, advances in methods usually adopt an implicit strategy with shared query, which can not fully realize potential these two interactive tasks. this paper, we argue that explicit considering distinct characteristics recognition significantly improve performance spotting. To end,...

10.48550/arxiv.2308.10147 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Bridging the Gap Between End-to-End and Two-Step Text Spotting

OPENALEX - Publications

Mingxin Huang Hongliang Li Yuliang Liu Xiang Bai Lianwen Jin

Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates issues error accumulation sub-optimal performance seen traditional two-step methodologies, methods continue to be favored many competitions practical settings due their superior modularity. In this paper, we introduce Bridging Text Spotting, novel approach that resolves suboptimal while retaining To achieve this, adopt well-trained detector recognizer...

10.48550/arxiv.2404.04624 preprint EN arXiv (Cornell University) 2024-04-06

Bridging the Gap Between End-to-End and Two-Step Text Spotting

OPENALEX - Publications

Mingxin Huang Hongliang Li Yuliang Liu Xiang Bai Lianwen Jin

10.1109/cvpr52733.2024.01478 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid

OPENALEX - Publications

Mingxin Huang Yuliang Liu Dingkang Liang Lianwen Jin Xiang Bai

Recently, scaling images to high resolution has received much attention in multimodal large language models (MLLMs). Most existing practices adopt a sliding-window-style cropping strategy adapt increase. Such strategy, however, can easily cut off objects and connected regions, which introduces semantic discontinuity therefore impedes MLLMs from recognizing small or irregularly shaped text, leading phenomenon we call the sawtooth effect. This effect is particularly evident lightweight MLLMs....

10.48550/arxiv.2408.02034 preprint EN arXiv (Cornell University) 2024-08-04

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

OPENALEX - Publications

Ling Fu Biao Yang Zhebin Kuang Jiajun Song Yuzhe Li and 19 more

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted impressive performance LMMs in text recognition; however, their abilities on certain challenging tasks, such as localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently most...

10.48550/arxiv.2501.00321 preprint EN arXiv (Cornell University) 2024-12-31

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

OPENALEX - Publications

Yuliang Liu Mingxin Huang Hao Yan Linger Deng Weijia Wu and 4 more

Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce new method, termed VimTS, which enhances generalization ability model by achieving better synergy among different tasks. Typically, propose Prompt Queries Generation Module Tasks-aware Adapter to effectively convert original single-task into multi-task suitable for...

10.48550/arxiv.2404.19652 preprint EN arXiv (Cornell University) 2024-04-30

SPTS v2: Single-Point Scene Text Spotting

OPENALEX - Publications

Yuliang Liu Jiaxin Zhang Dezhi Peng Mingxin Huang Xinyu Wang and 6 more

End-to-end scene text spotting has made significant progress due to its intrinsic synergy between detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated quadrangles, polygons a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us train high-performing text-spotting models single-point annotation. v2 reserves the advantage of auto-regressive Transformer with an Instance...

10.48550/arxiv.2301.01635 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Progressive Evolution from Single-Point to Polygon for Scene Text

OPENALEX - Publications

Linger Deng Mingxin Huang Xudong Xie Yuliang Liu Lianwen Jin and 1 more

The advancement of text shape representations towards compactness has enhanced detection and spotting performance, but at a high annotation cost. Current models use single-point annotations to reduce costs, yet they lack sufficient localization information for downstream applications. To overcome this limitation, we introduce Point2Polygon, which can efficiently transform single-points into compact polygons. Our method uses coarse-to-fine process, starting with creating selecting anchor...

10.48550/arxiv.2312.13778 preprint EN other-oa arXiv (Cornell University) 2023-01-01