- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- Visual Attention and Saliency Detection
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Advanced Vision and Imaging
- Human Pose and Action Recognition
- Advanced Image Fusion Techniques
- Computer Graphics and Visualization Techniques
- Machine Learning and ELM
- Image and Video Quality Assessment
- Face Recognition and Perception
- Industrial Vision Systems and Defect Detection
- Music and Audio Processing
- Handwritten Text Recognition Techniques
- Advanced Image Processing Techniques
- Speech and Audio Processing
- COVID-19 diagnosis using AI
- Generative Adversarial Networks and Image Synthesis
- Subtitles and Audiovisual Media
- Time Series Analysis and Forecasting
- Video Surveillance and Tracking Methods
- Semantic Web and Ontologies
- Target Tracking and Data Fusion in Sensor Networks
- Video Analysis and Summarization
Dalian University of Technology
2018-2025
Australian Centre for Robotic Vision
2023
The University of Adelaide
2021-2023
Shandong University of Science and Technology
2015
The high cost of pixel-level annotations makes it appealing to train saliency detection models with weak supervision. However, a single supervision source usually does not contain enough information well-performing model. To this end, we propose unified framework diverse sources. In paper, use category labels, captions, and unlabelled data for training, yet other sources can also be plugged into flexible framework. We design classification network (CNet) caption generation (PNet), which...
Existing weakly supervised semantic segmentation (WSSS) methods usually utilize the results of pre-trained saliency detection (SD) models without explicitly modelling connections between two tasks, which is not most efficient configuration. Here we propose a unified multi-task learning framework to jointly solve WSSS and SD using single network, i.e. network (SSNet). SSNet consists (SN) aggregation module (SAM). For an input image, SN generates result and, SAM predicts each category...
The discrimination of instance embeddings plays a vital role in associating instances across time for online video segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon items (CIs), which are sets anchor/positive/negative embeddings. Recent VIS methods leverage CIs sourced from one reference frame only, we argue insufficient highly discriminative Intuitively, possible strategy to enhance replicating inference phase during training. To...
The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling long-range dependencies struggles due to quadratic computational costs, presenting bottleneck complex scenarios. To overcome this limitation facilitate multi-modal comprehension with linear complexity, we introduce AVS-Mamba, selective state space model address the AVS task. Our framework incorporates...
Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges discerning the spatial positions, interrelations, and causal logic scenes when transitioning from to 3D representations. We find that limitations mainly lie in: i) high annotation cost restricting scale-up of volumes scene data, ii) lack a straightforward effective way perceive information which results prolonged training durations complicates streamlined framework. To this end, we...
Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in keyframe or entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) inject rich spatiotemporal features into hierarchical tokens.Our key innovations include Temporal Dynamic Aggregation (TDA)...
Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current understanding models struggle processing long sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. To address these issues, we propose StreamChat, a training-free framework for streaming reasoning conversational interaction. $\StreamChat$ leverages novel hierarchical memory...
Fully convolutional networks (FCN) has significantly improved the performance of many pixel-labeling tasks, such as semantic segmentation and depth estimation. However, it still remains non-trivial to thoroughly utilize multi-level feature maps boundary information for salient object detection. In this paper, we propose a novel FCN framework integrate features recurrently with guidance information. First, deep network is used extract separately aggregate them into multiple resolutions, which...
Existing weakly supervised semantic segmentation (WSSS) methods usually utilize the results of pre-trained saliency detection (SD) models without explicitly modeling connections between two tasks, which is not most efficient configuration. Here we propose a unified multi-task learning framework to jointly solve WSSS and SD using single network, \ie saliency, network (SSNet). SSNet consists (SN) aggregation module (SAM). For an input image, SN generates result and, SAM predicts each category...
Benefiting from the rapid development of Convolutional Neural Networks (CNNs), some salient object detection methods have achieved remarkable results by utilizing multi-level convolutional features. However, saliency training datasets is limited scale due to high cost pixel-level labeling, which leads a generalization trained model on new scenarios during testing. Besides, FCN-based directly integrate features, ignoring fact that noise in features are harmful detection. In this paper, we...
Few-shot Semantic Segmentation (FSS) is a challenging problem in computer vision. It aims at segmenting objects of the unseen categories given only one or several annotated samples. The essence FSS to disseminate information from support images query for mutual object categories. In this paper, we propose Dynamic Reasoning Network (DRNet) adaptively generate parameters predicting layers and infer segmentation mask each category. More specifically, an Attentional Feature Integration...
This paper aims to design monocular depth estimation models with better generalization abilities. To this end, we have conducted quantitative analysis and discovered two important insights. First, the Simulation Correlation phenomenon, commonly seen in long-tailed classification problems, also exists estimation, indicating that imbalanced distribution training data may be cause of limited ability. Second, long-tail values extends beyond dataset scale, manifests within each individual image,...
Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access entire historical dataset. However, mitigating performance degradation in large-scale is non-trivial due (i) parameter shifts throughout lifelong and (ii) significant computational burdens associated with full-model tuning. In this work, we present a parameter-efficient continual framework alleviate long-term forgetting incremental models. Our approach involves dynamic...
In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with or modeling relations, our method combines both aspects them within a unified framework. MTNet is devised effectively merging features during feature extraction process encoders, promoting more complementary representation. To...
Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the backbone. However, they compromise by exclusively relying on frozen intermediate outputs and limiting exhaustive exploration of prior knowledge from models....
To achieve content-consistent results in text-conditioned image editing, existing methods typically employ a reconstruction branch to capture the source details via diffusion inversion and generation synthesize target based on given textual prompt masked details. However, accurately segmenting is challenging with current fixed-threshold mask strategy. Additionally, inadequacies process can lead insufficient retention of In this paper, we propose method called SAMControl ( S oft A ttention M...
Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading computational burdens when expanding new modalities. In this paper, we propose PathWeave, a flexible scalable framework with modal-Path sWitching ExpAnsion abilities that enables MLLMs continually EVolve modalities for $\mathbb{X}$-modal...
Subject-driven image inpainting has emerged as a popular task in editing alongside recent advancements diffusion models. Previous methods primarily focus on identity preservation but struggle to maintain the editability of inserted objects. In response, this paper introduces DreamMix, diffusion-based generative model adept at inserting target objects into given scenes user-specified locations while concurrently enabling arbitrary text-driven modifications their attributes. particular, we...