- Multimodal Machine Learning Applications
- Natural Language Processing Techniques
- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Topic Modeling
- Video Analysis and Summarization
- Handwritten Text Recognition Techniques
- Advanced Vision and Imaging
- Vehicle License Plate Recognition
- Image Retrieval and Classification Techniques
- Human Motion and Animation
- Machine Learning and Data Classification
- Speech Recognition and Synthesis
- Face and Expression Recognition
- Face recognition and analysis
- Infrared Target Detection Methodologies
- Semantic Web and Ontologies
- Multimedia Communication and Technology
- Machine Learning in Healthcare
- Data Stream Mining Techniques
- Subtitles and Audiovisual Media
- Intelligent Tutoring Systems and Adaptive Learning
- Text and Document Classification Technologies
- Digital Humanities and Scholarship
ShangHai JiAi Genetics & IVF Institute
2024
Shanghai Artificial Intelligence Laboratory
2024
Nanyang Technological University
2021-2024
Group Sense (China)
2020
Southerners on New Ground
2020
University of Electronic Science and Technology of China
2018-2019
North China Electric Power University
2017
Scene text detection, an important step of scene reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed accuracy. second one model arbitrary-shaped instance. Recently, some methods have been proposed tackle but they rarely take entire pipeline into consideration, which may fall short in practical In this paper, we...
Scene text detection methods based on deep learning have achieved remarkable results over the past years. However, due to high diversity and complexity of natural scenes, previous state-of-the-art may still produce a considerable amount false positives, when applied images captured in real-world environments. To tackle this issue, mainly inspired by Mask R-CNN, we propose paper an effective model for scene detection, which is Feature Pyramid Network (FPN) instance segmentation. We supervised...
Instance segmentation has witnessed a remarkable progress on class-balanced benchmarks. However, they fail to perform as accurately in real-world scenarios, where the category distribution of objects naturally comes with long tail. Instances head classes dominate long-tailed dataset and serve negative samples tail categories. The overwhelming gradients lead biased learning process for classifiers. Consequently, categories are more likely be misclassified backgrounds or To tackle this...
Recent methods for long-tailed instance segmentation still struggle on rare object classes with few training data. We propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the data scarcity issue by augmenting feature space especially classes. Both (FA) sampling components are adaptive to actual status — FA is informed mean variance of observed real samples from past iterations, we sample generated virtual features in loss-adapted manner...
Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only small number of parameters in model's input space, has become trend the vision community since emergence large vision-language models like CLIP. We present systematic study on two representative prompt tuning methods, namely text visual tuning. A major finding is none unimodal methods performs consistently well: fails data with high intra-class variances while cannot handle low inter-class variances. To...
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent Artificial General Intelligence (AGI). However, replicating such advancements in open-source models been challenging. This paper introduces InternLM2, an LLM that outperforms its predecessors comprehensive evaluations across 6 dimensions 30 benchmarks, long-context modeling, open-ended subjective through innovative pre-training optimization techniques. process InternLM2 is meticulously...
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This goes beyond conventional understanding, adeptly crafting interleaved content from diverse inputs like outlines, detailed textual specifications, reference images, enabling highly customizable creation. InternLM-XComposer2 proposes Partial LoRA (PLoRA) approach that applies additional parameters exclusively to image tokens preserve the integrity of...
The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed enhance the high-resolution understanding capabilities of LVLMs, they remain capped at approximately 1500 x pixels and constrained a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, groundbreaking exploration into elevating LVLM up 4K...
Temporal Awareness, the ability to reason dynamically based on timestamp when a question is raised, key distinction between offline and online video LLMs. Unlike models, which rely complete videos for static, post hoc analysis, models process streams incrementally adapt their responses at posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), novel benchmark that emphasizes...
Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs LVLMs are scarce, and implementation details proprietary often unclear. We bridge this InternLM-XComposer2.5-Reward (IXC-2.5-Reward), simple yet effective model that...
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of 1D RoPE to video, with complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential effective adaptation which have not been fully considered in prior work. As part our analysis, we introduce challenging V-NIAH-D (Visual Needle-In-A-Haystack Distractors) task,...
Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion have enabled the imposition of consistent lighting. However, video still lags, primarily due to excessive training costs scarcity diverse, high-quality datasets. A simple application models on a frame-by-frame basis leads several issues: lighting source inconsistency relighted appearance inconsistency, resulting flickers generated videos. In this work, we propose Light-A-Video,...
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical...
Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical within ICL examples: granularity-mismatch ensuing negative-effect noise problem. Specifically, LLMs are capable dividing process yet mostly failed inaccurate reasoning few conquer steps, while examples retrieved...
Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming on fly. Unlike offline LLMs, which analyze entire before answering questions, active real-time requires three capabilities: 1) Perception: monitoring and capturing. 2) Decision: raising proactive in proper situations, 3) Reaction: continuous users. However, inherent conflicts exist...
We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware representations and decode them into parameters. 2) To further enhance performance, propose hybrid rendering integrates an efficient volume design for novel view synthesis. 3) support fast fine-tuning specific scenes, introduce multi-view geometric consistent aggregation...
This article introduces the solutions of two champion teams, `MMfruit' for detection track and `MMfruitSeg' segmentation track, in OpenImage Challenge 2019. It is commonly known that an object detector, shared feature at end backbone not appropriate both classification regression, which greatly limits performance single stage detector Faster RCNN \cite{ren2015faster} based detector. In this competition, we observe even with a feature, different locations one has completely inconsistent...
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions options, or world knowledge embedded in LLMs. This phenomenon prevalent across benchmarks. For instance, GeminiPro achieves 42.9% on MMMU benchmark without any...
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length with a constant number of tokens streamingly encoded and adaptively selected. The challenge understanding in the vision language area mainly lies significant computational burden caused by great extracted from long videos. Previous works rely on sparse sampling or frame compression to reduce tokens. However, such approaches either disregard...