- Handwritten Text Recognition Techniques
- Natural Language Processing Techniques
- Topic Modeling
- Multimodal Machine Learning Applications
- Image Processing and 3D Reconstruction
- Advanced Image and Video Retrieval Techniques
- Vehicle License Plate Recognition
- Domain Adaptation and Few-Shot Learning
- Speech Recognition and Synthesis
- Religious Tourism and Spaces
- Historical and Linguistic Studies
- Biblical Studies and Interpretation
- Music and Audio Processing
- Simulation and Modeling Applications
- Intelligent Tutoring Systems and Adaptive Learning
- Machine Learning and ELM
- Video Analysis and Summarization
- Advanced Neural Network Applications
- Neural Networks and Applications
- Historical and Architectural Studies
- Time Series Analysis and Forecasting
- Online and Blended Learning
- Speech and dialogue systems
- Data Visualization and Analytics
NetEase (China)
2022
Recently, transformer-based methods have achieved promising progresses in object detection, as they can eliminate the post-processes like NMS and enrich deep representations. However, these cannot well cope with scene text due to its extreme variance of scales aspect ratios. In this paper, we present a simple yet effective architecture for detection. Different from previous approaches that learn robust representations holistic manner, our method performs detection based on few representative...
End-to-end scene text spotting has made significant progress due to its intrinsic synergy between detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated quadrangles, polygons a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us train high-performing text-spotting models single-point annotation. v2 reserves the advantage of auto-regressive Transformer with an Instance...
We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into MLLM. Unlike prevalent MLLM architectures that rely on external vision modules encoding, VoRA internalizes visual capabilities by integrating vision-specific layers directly the LLM. This design allows added parameters to be seamlessly merged during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting LLM's ability of handling flexible context, can process...
Recently, diffusion models have emerged as promising newcomers in the field of generative models, shining brightly image generation. However, when employed for object removal tasks, they still encounter issues such generating random artifacts and incapacity to repaint foreground areas with appropriate content after removal. To tackle these problems, we propose Attentive Eraser, a tuning-free method empower pre-trained stable effective Firstly, light observation that self-attention maps...
This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges representation gap between separately pre-trained encoders LLMs by integrating partial views, which alleviates overemphasis prominent regions. To facilitate effective training of we collect large-scale detail-captioned image-text dataset named...
In the era of Large Language Models (LLMs), tremendous strides have been made in field multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing immense representation capabilities and rich world knowledge inherent these large pre-trained models, beneficial connections among tasks within context text-rich scenarios not sufficiently explored. this work, we introduce UniDoc, a novel model equipped with text detection recognition capabilities, which...
End-to-end scene text spotting has recently gained great attention in the research community. The majority of existing methods rely heavily on location annotations instances (e.g., word-level boxes, masks, and char-level boxes). We demonstrate that can be accomplished solely via transcription, significantly reducing need for costly annotations. propose a query-based paradigm to learn implicit features interaction queries image embeddings. These are then made explicit during recognition stage...
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short leading like GPT4V and Gemini, partly due to a lack extensive, high-quality instruction tuning data. To this end, we introduce new approach for creating massive, instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists four steps:...
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models the domain of scene understanding. However, most TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works expand multilingual QA pairs non-text-centric VQA datasets using translation engines, translation-based protocol encounters...
Tables contain factual and quantitative data accompanied by various structures contents that pose challenges for machine comprehension. Previous methods generally design task-specific architectures objectives individual tasks, resulting in modal isolation intricate workflows. In this paper, we present a novel large vision-language model, TabPedia, equipped with concept synergy mechanism. mechanism, all the involved diverse visual table understanding (VTU) tasks multi-source embeddings are...
Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods integrate limitations, such as producing overly long sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout Text in a Large Language Model (LayTextLLM)} understanding. particular, LayTextLLM projects...
This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges representation gap between separately pre-trained encoders LLMs by integrating partial views, which alleviates overemphasis prominent regions. To facilitate effective training of we collect large-scale detail-captioned image-text dataset named...
The comprehension of text-rich visual scenes has become a focal point for evaluating Multi-modal Large Language Models (MLLMs) due to their widespread applications. Current benchmarks tailored the scenario emphasize perceptual capabilities, while overlooking assessment cognitive abilities. To address this limitation, we introduce Multimodal benchmark towards Text-rich scenes, evaluate Cognitive capabilities MLLMs through reasoning and content-creation tasks (MCTBench). mitigate potential...
Recently, diffusion models have emerged as promising newcomers in the field of generative models, shining brightly image generation. However, when employed for object removal tasks, they still encounter issues such generating random artifacts and incapacity to repaint foreground areas with appropriate content after removal. To tackle these problems, we propose Attentive Eraser, a tuning-free method empower pre-trained stable effective Firstly, light observation that self-attention maps...
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted impressive performance LMMs in text recognition; however, their abilities on certain challenging tasks, such as localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently most...
This paper presents the inaugural character recognition competition for street view shop signs, including associated tasks, datasets, participating teams, winning team's solution, and justification award.
Recently, transformer-based methods have achieved promising progresses in object detection, as they can eliminate the post-processes like NMS and enrich deep representations. However, these cannot well cope with scene text due to its extreme variance of scales aspect ratios. In this paper, we present a simple yet effective architecture for detection. Different from previous approaches that learn robust representations holistic manner, our method performs detection based on few representative...
End-to-end scene text spotting has made significant progress due to its intrinsic synergy between detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated quadrangles, polygons a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us train high-performing text-spotting models single-point annotation. v2 reserves the advantage of auto-regressive Transformer with an Instance...
Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations, font diversity, shape deformations, etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario, but it computationally intensive and requires multiple copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from few demonstration examples training-free manner, termed "In-Context Learning" (ICL)....
Text detection and recognition are essential components of a modern OCR system. Most approaches attempt to obtain accurate bounding boxes text at the stage, which is used as input stage. We observe that when using tight input, recognizer frequently fails achieve optimal performance due inconsistency between deep representations recognition. In this paper, we propose Box Adjuster, reinforcement learning-based method for adjusting shape each box make it more compatible with models....