NFDI4DS | UHH-SEMS - Publication Details

Yunzhi Zhuge

ORCID: 0000-0002-4288-4516

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5048521072

Research Areas

Multimodal Machine Learning Applications
Advanced Neural Network Applications
Visual Attention and Saliency Detection
Domain Adaptation and Few-Shot Learning
Advanced Image and Video Retrieval Techniques
Advanced Vision and Imaging
Human Pose and Action Recognition
Advanced Image Fusion Techniques
Computer Graphics and Visualization Techniques
Machine Learning and ELM
Image and Video Quality Assessment
Face Recognition and Perception
Industrial Vision Systems and Defect Detection
Music and Audio Processing
Handwritten Text Recognition Techniques
Advanced Image Processing Techniques
Speech and Audio Processing
COVID-19 diagnosis using AI
Generative Adversarial Networks and Image Synthesis
Subtitles and Audiovisual Media
Time Series Analysis and Forecasting
Video Surveillance and Tracking Methods
Semantic Web and Ontologies
Target Tracking and Data Fusion in Sensor Networks
Video Analysis and Summarization

Dalian University of Technology
2018-2025

Australian Centre for Robotic Vision
2023

The University of Adelaide
2021-2023

Shandong University of Science and Technology
2015

Multi-Source Weak Supervision for Saliency Detection

OPENALEX - Publications

Yu Zeng Yunzhi Zhuge Huchuan Lu Lihe Zhang Mingyang Qian and 1 more

The high cost of pixel-level annotations makes it appealing to train saliency detection models with weak supervision. However, a single supervision source usually does not contain enough information well-performing model. To this end, we propose unified framework diverse sources. In paper, use category labels, captions, and unlabelled data for training, yet other sources can also be plugged into flexible framework. We design classification network (CNet) caption generation (PNet), which...

10.1109/cvpr.2019.00623 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Joint Learning of Saliency Detection and Weakly Supervised Semantic Segmentation

OPENALEX - Publications

Yu Zeng Yunzhi Zhuge Huchuan Lu Lihe Zhang

Existing weakly supervised semantic segmentation (WSSS) methods usually utilize the results of pre-trained saliency detection (SD) models without explicitly modelling connections between two tasks, which is not most efficient configuration. Here we propose a unified multi-task learning framework to jointly solve WSSS and SD using single network, i.e. network (SSNet). SSNet consists (SN) aggregation module (SAM). For an input image, SN generates result and, SAM predicts each category...

10.1109/iccv.2019.00732 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

CTVIS: Consistent Training for Online Video Instance Segmentation

OPENALEX - Publications

Kaining Ying Qing Zhong Weian Mao Zhenhua Wang Hao Chen and 5 more

The discrimination of instance embeddings plays a vital role in associating instances across time for online video segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon items (CIs), which are sets anchor/positive/negative embeddings. Recent VIS methods leverage CIs sourced from one reference frame only, we argue insufficient highly discriminative Intuitively, possible strategy to enhance replicating inference phase during training. To...

10.1109/iccv51070.2023.00089 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

OPENALEX - Publications

Jiazuo Yu Yunzhi Zhuge Lu Zhang Ping Hu Dong Wang and 2 more

10.1109/cvpr52733.2024.02191 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Learning Local-Global Representation for Scribble-based RGB-D Salient Object Detection via Transformer

OPENALEX - Publications

Yue Wang Lu Zhang Pingping Zhang Yunzhi Zhuge Junfeng Wu and 2 more

10.1109/tcsvt.2024.3424651 article EN IEEE Transactions on Circuits and Systems for Video Technology 2024-07-08

AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation

OPENALEX - Publications

Shilei Gong Yunzhi Zhuge Pengfei Zhang Yifan Wang Pingping Zhang and 2 more

The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling long-range dependencies struggles due to quadratic computational costs, presenting bottleneck complex scenarios. To overcome this limitation facilitate multi-modal comprehension with linear complexity, we introduce AVS-Mamba, selective state space model address the AVS task. Our framework incorporates...

10.48550/arxiv.2501.07810 preprint EN arXiv (Cornell University) 2025-01-13

3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding

OPENALEX - Publications

Hongyan Xiong Yunzhi Zhuge Jiawen Zhu Pengfei Zhang Huchuan Lu

Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges discerning the spatial positions, interrelations, and causal logic scenes when transitioning from to 3D representations. We find that limitations mainly lie in: i) high annotation cost restricting scale-up of volumes scene data, ii) lack a straightforward effective way perceive information which results prolonged training durations complicates streamlined framework. To this end, we...

10.48550/arxiv.2501.07819 preprint EN arXiv (Cornell University) 2025-01-13

The Devil is in Temporal Token: High Quality Video Reasoning Segmentation

OPENALEX - Publications

Shaogang Gong Yunzhi Zhuge Pengfei Zhang Zongxin Yang Pingping Zhang and 1 more

Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in keyframe or entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) inject rich spatiotemporal features into hierarchical tokens.Our key innovations include Temporal Dynamic Aggregation (TDA)...

10.48550/arxiv.2501.08549 preprint EN arXiv (Cornell University) 2025-01-14

Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge

OPENALEX - Publications

Hongyan Xiong Zongxin Yang Jiazuo Yu Yunzhi Zhuge Pengfei Zhang and 2 more

Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current understanding models struggle processing long sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. To address these issues, we propose StreamChat, a training-free framework for streaming reasoning conversational interaction. $\StreamChat$ leverages novel hierarchical memory...

10.48550/arxiv.2501.13468 preprint EN arXiv (Cornell University) 2025-01-23

AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation

OPENALEX - Publications

Shilei Gong Yunzhi Zhuge Lu Zhang Yifan Wang Pingping Zhang and 2 more

10.1109/tmm.2025.3542995 article IEEE Transactions on Multimedia 2025-01-01

Boundary-Guided Feature Aggregation Network for Salient Object Detection

OPENALEX - Publications

Yunzhi Zhuge Gang Yang Pingping Zhang Huchuan Lu

Fully convolutional networks (FCN) has significantly improved the performance of many pixel-labeling tasks, such as semantic segmentation and depth estimation. However, it still remains non-trivial to thoroughly utilize multi-level feature maps boundary information for salient object detection. In this paper, we propose a novel FCN framework integrate features recurrently with guidance information. First, deep network is used extract separately aggregate them into multiple resolutions, which...

10.1109/lsp.2018.2875586 article EN IEEE Signal Processing Letters 2018-10-11

Joint Learning of Saliency Detection and Weakly Supervised Semantic Segmentation

OPENALEX - Publications

Yu Zeng Yunzhi Zhuge Huchuan Lu Lihe Zhang

Existing weakly supervised semantic segmentation (WSSS) methods usually utilize the results of pre-trained saliency detection (SD) models without explicitly modeling connections between two tasks, which is not most efficient configuration. Here we propose a unified multi-task learning framework to jointly solve WSSS and SD using single network, \ie saliency, network (SSNet). SSNet consists (SN) aggregation module (SAM). For an input image, SN generates result and, SAM predicts each category...

10.48550/arxiv.1909.04161 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Deep Embedding Features for Salient Object Detection

OPENALEX - Publications

Yunzhi Zhuge Yu Zeng Huchuan Lu

Benefiting from the rapid development of Convolutional Neural Networks (CNNs), some salient object detection methods have achieved remarkable results by utilizing multi-level convolutional features. However, saliency training datasets is limited scale due to high cost pixel-level labeling, which leads a generalization trained model on new scenarios during testing. Besides, FCN-based directly integrate features, ignoring fact that noise in features are harmful detection. In this paper, we...

10.1609/aaai.v33i01.33019340 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2019-07-17

Deep Reasoning Network for Few-shot Semantic Segmentation

OPENALEX - Publications

Yunzhi Zhuge Chunhua Shen

Few-shot Semantic Segmentation (FSS) is a challenging problem in computer vision. It aims at segmenting objects of the unseen categories given only one or several annotated samples. The essence FSS to disseminate information from support images query for mutual object categories. In this paper, we propose Dynamic Reasoning Network (DRNet) adaptively generate parameters predicting layers and infer segmentation mask each category. More specifically, an Attentional Feature Integration...

10.1145/3474085.3475658 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

DME: Unveiling the Bias for Better Generalized Monocular Depth Estimation

OPENALEX - Publications

Songsong Yu Yifan Wang Yunzhi Zhuge Lijun Wang Huchuan Lu

This paper aims to design monocular depth estimation models with better generalization abilities. To this end, we have conducted quantitative analysis and discovered two important insights. First, the Simulation Correlation phenomenon, commonly seen in long-tailed classification problems, also exists estimation, indicating that imbalanced distribution training data may be cause of limited ability. Second, long-tail values extends beyond dataset scale, manifests within each individual image,...

10.1609/aaai.v38i7.28506 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

Remote Sensing Image Super-Resolution Using Enriched Spatial-Channel Feature Aggregation Networks

OPENALEX - Publications

Shuai Hao Yunzhi Zhuge Jia Xu Huchuan Lu You He

10.1109/docs63458.2024.10704496 article EN 2024-08-16

Robust Video Text Detection with Morphological Filtering Enhanced MSER

OPENALEX - Publications

Yunzhi Zhuge Huchuan Lu

10.1007/s11390-015-1528-z article EN Journal of Computer Science and Technology 2015-03-01

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters

OPENALEX - Publications

Jiazuo Yu Yunzhi Zhuge Lu Zhang Dong Wang Huchuan Lu and 1 more

Continual learning can empower vision-language models to continuously acquire new knowledge, without the need for access entire historical dataset. However, mitigating performance degradation in large-scale is non-trivial due (i) parameter shifts throughout lifelong and (ii) significant computational burdens associated with full-model tuning. In this work, we present a parameter-efficient continual framework alleviate long-term forgetting incremental models. Our approach involves dynamic...

10.48550/arxiv.2403.11549 preprint EN arXiv (Cornell University) 2024-03-18

Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation

OPENALEX - Publications

Yunzhi Zhuge Hongyu Gu Lu Zhang Jinqing Qi Huchuan Lu

In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with or modeling relations, our method combines both aspects them within a unified framework. MTNet is devised effectively merging features during feature extraction process encoders, promoting more complementary representation. To...

10.1109/tnnls.2024.3418980 article EN IEEE Transactions on Neural Networks and Learning Systems 2024-01-01

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

OPENALEX - Publications

Haiwen Diao Bo Wan Xu Jia Yunzhi Zhuge Ying Zhang and 2 more

Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the backbone. However, they compromise by exclusively relying on frozen intermediate outputs and limiting exhaustive exploration of prior knowledge from models....

10.48550/arxiv.2407.07523 preprint EN arXiv (Cornell University) 2024-07-10

SAMControl: Controlling Pose and Object for Image Editing with Soft Attention Mask

OPENALEX - Publications

Yue Zhang Chao Wang Feifei Fang Yunzhi Zhuge Hehe Fan and 3 more

To achieve content-consistent results in text-conditioned image editing, existing methods typically employ a reconstruction branch to capture the source details via diffusion inversion and generation synthesize target based on given textual prompt masked details. However, accurately segmenting is challenging with current fixed-threshold mask strategy. Additionally, inadequacies process can lead insufficient retention of In this paper, we propose method called SAMControl ( S oft A ttention M...

10.1145/3702999 article EN ACM Transactions on Multimedia Computing Communications and Applications 2024-11-05

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

OPENALEX - Publications

Jiazuo Yu Hongyan Xiong Pengfei Zhang Haiwen Diao Yunzhi Zhuge and 5 more

Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading computational burdens when expanding new modalities. In this paper, we propose PathWeave, a flexible scalable framework with modal-Path sWitching ExpAnsion abilities that enables MLLMs continually EVolve modalities for $\mathbb{X}$-modal...

10.48550/arxiv.2410.20178 preprint EN arXiv (Cornell University) 2024-10-26

DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

OPENALEX - Publications

Yicheng Yang Pengxiang Li Lu Zhang Liqian Ma Ping Hu and 4 more

Subject-driven image inpainting has emerged as a popular task in editing alongside recent advancements diffusion models. Previous methods primarily focus on identity preservation but struggle to maintain the editability of inserted objects. In response, this paper introduces DreamMix, diffusion-based generative model adept at inserting target objects into given scenes user-specified locations while concurrently enabling arbitrary text-driven modifications their attributes. particular, we...

10.48550/arxiv.2411.17223 preprint EN arXiv (Cornell University) 2024-11-26

Coming Soon ...