- Handwritten Text Recognition Techniques
- Natural Language Processing Techniques
- Image Processing and 3D Reconstruction
- Speech Recognition and Synthesis
- Multimodal Machine Learning Applications
- Hand Gesture Recognition Systems
- Advanced Neural Network Applications
- Image Retrieval and Classification Techniques
- Video Analysis and Summarization
- Web Data Mining and Analysis
- Topic Modeling
- Vehicle License Plate Recognition
- Fiscal Policy and Economic Growth
- Higher Education Governance and Development
- Domain Adaptation and Few-Shot Learning
- Soil and Environmental Studies
- Agricultural risk and resilience
- Allelopathy and phytotoxic interactions
- Environmental Sustainability and Technology
- Interactive and Immersive Displays
- Machine Learning and ELM
- Legal Education and Practice Innovations
- Remote Sensing in Agriculture
- Advanced Image and Video Retrieval Techniques
- Public health and occupational medicine
South China University of Technology
2022-2025
Jilin University
2024
Hunan University
2008-2012
Wuhan University of Technology
2007-2008
Chongqing Technology and Business University
2004
End-to-end scene text spotting has attracted great attention in recent years due to the success of excavating intrinsic synergy detection and recognition. However, state-of-the-art methods usually incorporate recognition simply by sharing backbone, which does not directly take advantage feature interaction between two tasks. In this paper, we propose a new end-to-end framework termed SwinTextSpotter. Using transformer encoder with dynamic head as detector, unify tasks novel Recognition...
Existing scene text spotting (i.e., end-to-end detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level boxes). For the first time, we demonstrate that training models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose method tackles as sequence prediction task. Given image input, formulate desired recognition results discrete tokens use auto-regressive Transformer to predict...
End-to-end scene text spotting has made significant progress due to its intrinsic synergy between detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated quadrangles, polygons a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us train high-performing text-spotting models single-point annotation. v2 reserves the advantage of auto-regressive Transformer with an Instance...
Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness text-related visual tasks remains relatively unexplored. In this paper, we conducted comprehensive evaluation of Multimodal Models, such as GPT4V Gemini, various including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), Handwritten Mathematical Expression...
In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown crucial importance of intrinsic synergy between detection and recognition, advances in methods usually adopt an implicit strategy with shared query, which can not fully realize potential these two interactive tasks. this paper, we argue that explicit considering distinct characteristics recognition significantly improve performance spotting. To end,...
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce new method, termed VimTS, which enhances generalization ability model by achieving better synergy among different tasks. Typically, propose Prompt Queries Generation Module Tasks-aware Adapter to effectively convert original single-task into multi-task suitable for...
End-to-end scene text spotting, which aims to read the in natural images, has garnered significant attention recent years. However, state-of-the-art methods usually incorporate detection and recognition simply by sharing backbone, does not directly take advantage of feature interaction between two tasks. In this paper, we propose a new end-to-end spotting framework termed SwinTextSpotter v2, seeks find better synergy recognition. Specifically, enhance relationship tasks using novel...
Fine-tuning pre-trained Vision Transformers (ViT) has consistently demonstrated promising performance in the realm of visual recognition. However, adapting large models to various tasks poses a significant challenge. This challenge arises from need for each model undergo an independent and comprehensive fine-tuning process, leading substantial computational memory demands. While recent advancements Parameter-efficient Transfer Learning (PETL) have their ability achieve superior compared full...
In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown crucial importance of intrinsic synergy between detection and recognition, advances in methods usually adopt an implicit strategy with shared query, which can not fully realize potential these two interactive tasks. this paper, we argue that explicit considering distinct characteristics recognition significantly improve performance spotting. To end,...
Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates issues error accumulation sub-optimal performance seen traditional two-step methodologies, methods continue to be favored many competitions practical settings due their superior modularity. In this paper, we introduce Bridging Text Spotting, novel approach that resolves suboptimal while retaining To achieve this, adopt well-trained detector recognizer...
Recently, scaling images to high resolution has received much attention in multimodal large language models (MLLMs). Most existing practices adopt a sliding-window-style cropping strategy adapt increase. Such strategy, however, can easily cut off objects and connected regions, which introduces semantic discontinuity therefore impedes MLLMs from recognizing small or irregularly shaped text, leading phenomenon we call the sawtooth effect. This effect is particularly evident lightweight MLLMs....
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted impressive performance LMMs in text recognition; however, their abilities on certain challenging tasks, such as localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently most...
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce new method, termed VimTS, which enhances generalization ability model by achieving better synergy among different tasks. Typically, propose Prompt Queries Generation Module Tasks-aware Adapter to effectively convert original single-task into multi-task suitable for...
End-to-end scene text spotting has made significant progress due to its intrinsic synergy between detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated quadrangles, polygons a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us train high-performing text-spotting models single-point annotation. v2 reserves the advantage of auto-regressive Transformer with an Instance...
The advancement of text shape representations towards compactness has enhanced detection and spotting performance, but at a high annotation cost. Current models use single-point annotations to reduce costs, yet they lack sufficient localization information for downstream applications. To overcome this limitation, we introduce Point2Polygon, which can efficiently transform single-points into compact polygons. Our method uses coarse-to-fine process, starting with creating selecting anchor...