NFDI4DS | UHH-SEMS - Publication Details

Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection

OPENALEX - Publications

Jingqun Tang Wenqing Zhang Hongye Liu Mingkun Yang Bo Jiang and 2 more

Recently, transformer-based methods have achieved promising progresses in object detection, as they can eliminate the post-processes like NMS and enrich deep representations. However, these cannot well cope with scene text due to its extreme variance of scales aspect ratios. In this paper, we present a simple yet effective architecture for detection. Different from previous approaches that learn robust representations holistic manner, our method performs detection based on few representative...

10.1109/cvpr52688.2022.00452 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

SPTS v2: Single-Point Scene Text Spotting

OPENALEX - Publications

Yuliang Liu Jiaxin Zhang Dezhi Peng Mingxin Huang Xinyu Wang and 6 more

End-to-end scene text spotting has made significant progress due to its intrinsic synergy between detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated quadrangles, polygons a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us train high-performing text-spotting models single-point annotation. v2 reserves the advantage of auto-regressive Transformer with an Instance...

10.1109/tpami.2023.3312285 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-09-05

Vision as LoRA

OPENALEX - Publications

Han Wang Yongjie Ye Benyuan Li Yuxiang Nie Jinghui Lu and 3 more

We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into MLLM. Unlike prevalent MLLM architectures that rely on external vision modules encoding, VoRA internalizes visual capabilities by integrating vision-specific layers directly the LLM. This design allows added parameters to be seamlessly merged during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting LLM's ability of handling flexible context, can process...

10.48550/arxiv.2503.20680 preprint EN arXiv (Cornell University) 2025-03-26

Attentive Eraser: Unleashing Diffusion Model’s Object Removal Potential via Self-Attention Redirection Guidance

OPENALEX - Publications

Wenhao Sun Xuemei Dong Benlei Cui Jingqun Tang

Recently, diffusion models have emerged as promising newcomers in the field of generative models, shining brightly image generation. However, when employed for object removal tasks, they still encounter issues such generating random artifacts and incapacity to repaint foreground areas with appropriate content after removal. To tackle these problems, we propose Attentive Eraser, a tuning-free method empower pre-trained stable effective Firstly, light observation that self-attention maps...

10.1609/aaai.v39i19.34285 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

ParGo: Bridging Vision-Language with Partial and Global Views

OPENALEX - Publications

Anlan Wang Bin Shan W. Shi Kun‐Yu Lin Xiang Fei and 5 more

This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges representation gap between separately pre-trained encoders LLMs by integrating partial views, which alleviates overemphasis prominent regions. To facilitate effective training of we collect large-scale detail-captioned image-text dataset named...

10.1609/aaai.v39i7.32806 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

OPENALEX - Publications

Zhen Zhao Jingqun Tang Chun‐Hui Lin Binghong Wu Can Huang and 4 more

10.1109/cvpr52733.2024.01474 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

OPENALEX - Publications

Hao Feng Zijian Wang Jingqun Tang Jinghui Lu Wengang Zhou and 2 more

In the era of Large Language Models (LLMs), tremendous strides have been made in field multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing immense representation capabilities and rich world knowledge inherent these large pre-trained models, beneficial connections among tasks within context text-rich scenarios not sufficiently explored. this work, we introduce UniDoc, a novel model equipped with text detection recognition capabilities, which...

10.48550/arxiv.2308.11592 preprint EN other-oa arXiv (Cornell University) 2023-01-01

You Can even Annotate Text with Voice: Transcription-only-Supervised Text Spotting

OPENALEX - Publications

Jingqun Tang Qiao Su Benlei Cui Yuhang Ma Sheng Zhang and 1 more

End-to-end scene text spotting has recently gained great attention in the research community. The majority of existing methods rely heavily on location annotations instances (e.g., word-level boxes, masks, and char-level boxes). We demonstrate that can be accomplished solely via transcription, significantly reducing need for costly annotations. propose a query-based paradigm to learn implicit features interaction queries image embeddings. These are then made explicit during recognition stage...

10.1145/3503161.3547787 article EN Proceedings of the 30th ACM International Conference on Multimedia 2022-10-10

DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding

OPENALEX - Publications

Hao Feng Qi Liu Hao Liu Jingqun Tang Wengang Zhou and 2 more

10.1007/s11432-024-4250-y article EN Science China Information Sciences 2024-12-01

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

OPENALEX - Publications

Jingqun Tang Chunhui Lin Zhen Zhao Wei Shu Binghong Wu and 11 more

Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short leading like GPT4V and Gemini, partly due to a lack extensive, high-quality instruction tuning data. To this end, we introduce new approach for creating massive, instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists four steps:...

10.48550/arxiv.2404.12803 preprint EN arXiv (Cornell University) 2024-04-19

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

OPENALEX - Publications

Jingqun Tang Qi Liu Yongjie Ye Jinghui Lu Wei Shu and 10 more

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models the domain of scene understanding. However, most TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works expand multilingual QA pairs non-text-centric VQA datasets using translation engines, translation-based protocol encounters...

10.48550/arxiv.2405.11985 preprint EN arXiv (Cornell University) 2024-05-20

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

OPENALEX - Publications

Weichao Zhao Feng Hao Qi Liu Jingqun Tang Shu Wei and 6 more

Tables contain factual and quantitative data accompanied by various structures contents that pose challenges for machine comprehension. Previous methods generally design task-specific architectures objectives individual tasks, resulting in modal isolation intricate workflows. In this paper, we present a novel large vision-language model, TabPedia, equipped with concept synergy mechanism. mechanism, all the involved diverse visual table understanding (VTU) tasks multi-source embeddings are...

10.48550/arxiv.2406.01326 preprint EN arXiv (Cornell University) 2024-06-03

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

OPENALEX - Publications

Jinghui Lu Haiyang Yu Yanjie Wang Yongjie Ye Jingqun Tang and 7 more

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods integrate limitations, such as producing overly long sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout Text in a Large Language Model (LayTextLLM)} understanding. particular, LayTextLLM projects...

10.48550/arxiv.2407.01976 preprint EN arXiv (Cornell University) 2024-07-02

ParGo: Bridging Vision-Language with Partial and Global Views

OPENALEX - Publications

Anlan Wang Bin Shan W. Shi Kun‐Yu Lin Xiang Fei and 5 more

This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges representation gap between separately pre-trained encoders LLMs by integrating partial views, which alleviates overemphasis prominent regions. To facilitate effective training of we collect large-scale detail-captioned image-text dataset named...

10.48550/arxiv.2408.12928 preprint EN arXiv (Cornell University) 2024-08-23

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

OPENALEX - Publications

Bin Shan Xiang Fei Wei Shi Anlan Wang Guozhi Tang and 4 more

The comprehension of text-rich visual scenes has become a focal point for evaluating Multi-modal Large Language Models (MLLMs) due to their widespread applications. Current benchmarks tailored the scenario emphasize perceptual capabilities, while overlooking assessment cognitive abilities. To address this limitation, we introduce Multimodal benchmark towards Text-rich scenes, evaluate Cognitive capabilities MLLMs through reasoning and content-creation tasks (MCTBench). mitigate potential...

10.48550/arxiv.2410.11538 preprint EN arXiv (Cornell University) 2024-10-15

Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance

OPENALEX - Publications

Wenhao Sun Benlei Cui Jingqun Tang Xuemei Dong

Recently, diffusion models have emerged as promising newcomers in the field of generative models, shining brightly image generation. However, when employed for object removal tasks, they still encounter issues such generating random artifacts and incapacity to repaint foreground areas with appropriate content after removal. To tackle these problems, we propose Attentive Eraser, a tuning-free method empower pre-trained stable effective Firstly, light observation that self-attention maps...

10.48550/arxiv.2412.12974 preprint EN arXiv (Cornell University) 2024-12-17

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

OPENALEX - Publications

Ling Fu Biao Yang Zhebin Kuang Jiajun Song Yuzhe Li and 19 more

Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted impressive performance LMMs in text recognition; however, their abilities on certain challenging tasks, such as localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently most...

10.48550/arxiv.2501.00321 preprint EN arXiv (Cornell University) 2024-12-31

Character recognition competition for street view shop signs

OPENALEX - Publications

Jingqun Tang Wei‐Dong Du Bin Wang Wenyang Zhou Shuqi Mei and 3 more

This paper presents the inaugural character recognition competition for street view shop signs, including associated tasks, datasets, participating teams, winning team's solution, and justification award.

10.1093/nsr/nwad141 article EN cc-by National Science Review 2023-05-10

Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection

OPENALEX - Publications

Jingqun Tang Wenqing Zhang Hongye Liu Mingkun Yang Bo Jiang and 2 more

Recently, transformer-based methods have achieved promising progresses in object detection, as they can eliminate the post-processes like NMS and enrich deep representations. However, these cannot well cope with scene text due to its extreme variance of scales aspect ratios. In this paper, we present a simple yet effective architecture for detection. Different from previous approaches that learn robust representations holistic manner, our method performs detection based on few representative...

10.48550/arxiv.2203.15221 preprint EN other-oa arXiv (Cornell University) 2022-01-01

SPTS v2: Single-Point Scene Text Spotting

OPENALEX - Publications

Yuliang Liu Jiaxin Zhang Dezhi Peng Mingxin Huang Xinyu Wang and 6 more

End-to-end scene text spotting has made significant progress due to its intrinsic synergy between detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated quadrangles, polygons a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us train high-performing text-spotting models single-point annotation. v2 reserves the advantage of auto-regressive Transformer with an Instance...

10.48550/arxiv.2301.01635 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

OPENALEX - Publications

Zhen Zhao Can Huang Binghong Wu Chunhui Lin Hao Líu and 4 more

Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it computationally intensive and requires multiple copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from few demonstration examples training-free manner, termed "In-Context Learning" (ICL)....

10.48550/arxiv.2311.13120 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Optimal Boxes: Boosting End-to-End Scene Text Recognition by Adjusting Annotated Bounding Boxes via Reinforcement Learning

OPENALEX - Publications

Jingqun Tang Wenming Qian Luchuan Song Xiena Dong Lan Li and 1 more

Text detection and recognition are essential components of a modern OCR system. Most approaches attempt to obtain accurate bounding boxes text at the stage, which is used as input stage. We observe that when using tight input, recognizer frequently fails achieve optimal performance due inconsistency between deep representations recognition. In this paper, we propose Box Adjuster, reinforcement learning-based method for adjusting shape each box make it more compatible with models....

10.48550/arxiv.2207.11934 preprint EN other-oa arXiv (Cornell University) 2022-01-01