NFDI4DS | UHH-SEMS - Publication Details

Yuhang Zang

ORCID: 0000-0003-1110-5062

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5005200501

Research Areas

Multimodal Machine Learning Applications
Natural Language Processing Techniques
Advanced Image and Video Retrieval Techniques
Advanced Neural Network Applications
Domain Adaptation and Few-Shot Learning
Topic Modeling
Video Analysis and Summarization
Handwritten Text Recognition Techniques
Advanced Vision and Imaging
Vehicle License Plate Recognition
Image Retrieval and Classification Techniques
Human Motion and Animation
Machine Learning and Data Classification
Speech Recognition and Synthesis
Face and Expression Recognition
Face recognition and analysis
Infrared Target Detection Methodologies
Semantic Web and Ontologies
Multimedia Communication and Technology
Machine Learning in Healthcare
Data Stream Mining Techniques
Subtitles and Audiovisual Media
Intelligent Tutoring Systems and Adaptive Learning
Text and Document Classification Technologies
Digital Humanities and Scholarship

ShangHai JiAi Genetics & IVF Institute
2024

Shanghai Artificial Intelligence Laboratory
2024

Nanyang Technological University
2021-2024

Group Sense (China)
2020

Southerners on New Ground
2020

University of Electronic Science and Technology of China
2018-2019

North China Electric Power University
2017

Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network

OPENALEX - Publications

Wenhai Wang Enze Xie Xiaoge Song Yuhang Zang Wenjia Wang and 3 more

Scene text detection, an important step of scene reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed accuracy. second one model arbitrary-shaped instance. Recently, some methods have been proposed tackle but they rarely take entire pipeline into consideration, which may fall short in practical In this paper, we...

10.1109/iccv.2019.00853 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

Scene Text Detection with Supervised Pyramid Context Network

OPENALEX - Publications

Enze Xie Yuhang Zang Shuai Shao Gang Yu Cong Yao and 1 more

Scene text detection methods based on deep learning have achieved remarkable results over the past years. However, due to high diversity and complexity of natural scenes, previous state-of-the-art may still produce a considerable amount false positives, when applied images captured in real-world environments. To tackle this issue, mainly inspired by Mask R-CNN, we propose paper an effective model for scene detection, which is Feature Pyramid Network (FPN) instance segmentation. We supervised...

10.1609/aaai.v33i01.33019038 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2019-07-17

Seesaw Loss for Long-Tailed Instance Segmentation

OPENALEX - Publications

Jiaqi Wang Wenwei Zhang Yuhang Zang Yuhang Cao Jiangmiao Pang and 5 more

Instance segmentation has witnessed a remarkable progress on class-balanced benchmarks. However, they fail to perform as accurately in real-world scenarios, where the category distribution of objects naturally comes with long tail. Instances head classes dominate long-tailed dataset and serve negative samples tail categories. The overwhelming gradients lead biased learning process for classifiers. Consequently, categories are more likely be misclassified backgrounds or To tackle this...

10.1109/cvpr46437.2021.00957 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Contextual Object Detection with Multimodal Large Language Models

OPENALEX - Publications

Yuhang Zang Wei Li Jun Han Kaiyang Zhou Chen Change Loy

10.1007/s11263-024-02214-4 article EN International Journal of Computer Vision 2024-08-20

Alpha-CLIP: A CLIP Model Focusing on Wherever you Want

OPENALEX - Publications

Zeyi Sun Fang Ye Tong Wu Pan Zhang Yuhang Zang and 4 more

10.1109/cvpr52733.2024.01237 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation

OPENALEX - Publications

Yuhang Zang Chen Huang Chen Change Loy

Recent methods for long-tailed instance segmentation still struggle on rare object classes with few training data. We propose a simple yet effective method, Feature Augmentation and Sampling Adaptation (FASA), that addresses the data scarcity issue by augmenting feature space especially classes. Both (FA) sampling components are adaptive to actual status — FA is informed mean variance of observed real samples from past iterations, we sample generated virtual features in loss-adapted manner...

10.1109/iccv48922.2021.00344 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Unified Vision and Language Prompt Learning

OPENALEX - Publications

Yuhang Zang Wei Li Kaiyang Zhou Chen Huang Chen Change Loy

Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only small number of parameters in model's input space, has become trend the vision community since emergence large vision-language models like CLIP. We present systematic study on two representative prompt tuning methods, namely text visual tuning. A major finding is none unimodal methods performs consistently well: fails data with high intra-class variances while cannot handle low inter-class variances. To...

10.48550/arxiv.2210.07225 preprint EN cc-by-nc-nd arXiv (Cornell University) 2022-01-01

InternLM2 Technical Report

OPENALEX - Publications

Zheng Cai Maosong Cao Haojiong Chen Chaoyu Chen Keyu Chen and 95 more

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent Artificial General Intelligence (AGI). However, replicating such advancements in open-source models been challenging. This paper introduces InternLM2, an LLM that outperforms its predecessors comprehensive evaluations across 6 dimensions 30 benchmarks, long-context modeling, open-ended subjective through innovative pre-training optimization techniques. process InternLM2 is meticulously...

10.48550/arxiv.2403.17297 preprint EN arXiv (Cornell University) 2024-03-25

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

OPENALEX - Publications

Xiaoyi Dong Pan Zhang Yuhang Zang Yuhang Cao Bin Wang and 18 more

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This goes beyond conventional understanding, adeptly crafting interleaved content from diverse inputs like outlines, detailed textual specifications, reference images, enabling highly customizable creation. InternLM-XComposer2 proposes Partial LoRA (PLoRA) approach that applies additional parameters exclusively to image tokens preserve the integrity of...

10.48550/arxiv.2401.16420 preprint EN arXiv (Cornell University) 2024-01-29

Semi-Supervised and Long-Tailed Object Detection with CascadeMatch

OPENALEX - Publications

Yuhang Zang Kaiyang Zhou Chen Huang Chen Change Loy

10.1007/s11263-022-01738-x article EN International Journal of Computer Vision 2023-01-06

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

OPENALEX - Publications

Xiaoyi Dong Pan Zhang Yuhang Zang Yuhang Cao Bin Wang and 19 more

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed enhance the high-resolution understanding capabilities of LVLMs, they remain capped at approximately 1500 x pixels and constrained a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, groundbreaking exploration into elevating LVLM up 4K...

10.48550/arxiv.2404.06512 preprint EN arXiv (Cornell University) 2024-04-09

VLMEvalKit: An Open-Source ToolKit for Evaluating Large Multi-Modality Models

OPENALEX - Publications

Haodong Duan Junming Yang Yuxuan Qiao Xinyu Fang Lin Chen and 7 more

10.1145/3664647.3685520 article EN 2024-10-26

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

OPENALEX - Publications

Yuhang Zang Xiaoyi Dong Pan Zhang Yuhang Cao Ziyu Liu and 8 more

Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs LVLMs are scarce, and implementation details proprietary often unclear. We bridge this InternLM-XComposer2.5-Reward (IXC-2.5-Reward), simple yet effective model that...

10.48550/arxiv.2501.12368 preprint EN arXiv (Cornell University) 2025-01-21

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

OPENALEX - Publications

Beichen Zhang Yuhong Liu Xiaoyi Dong Yuhang Zang Pan Zhang and 4 more

Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical within ICL examples: granularity-mismatch ensuing negative-effect noise problem. Specifically, LLMs are capable dividing process yet mostly failed inaccurate reasoning few conquer steps, while examples retrieved...

10.48550/arxiv.2501.03226 preprint EN arXiv (Cornell University) 2025-01-06

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

OPENALEX - Publications

Rui Qian Shuangrui Ding Xiaoyi Dong Pan Zhang Yuhang Zang and 3 more

Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming on fly. Unlike offline LLMs, which analyze entire before answering questions, active real-time requires three capabilities: 1) Perception: monitoring and capturing. 2) Decision: raising proactive in proper situations, 3) Reaction: continuous users. However, inherent conflicts exist...

10.48550/arxiv.2501.03218 preprint EN arXiv (Cornell University) 2025-01-06

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

OPENALEX - Publications

Yifei Li Junbo Niu Zhengfei Miao Chunjiang Ge Yuanhang Zhou and 10 more

Temporal Awareness, the ability to reason dynamically based on timestamp when a question is raised, key distinction between offline and online video LLMs. Unlike models, which rely complete videos for static, post hoc analysis, models process streams incrementally adapt their responses at posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), novel benchmark that emphasizes...

10.48550/arxiv.2501.05510 preprint EN arXiv (Cornell University) 2025-01-09

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

OPENALEX - Publications

Xilin Wei Xiaoran Liu Yuhang Zang Xiaoyi Dong Pan Zhang and 7 more

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of 1D RoPE to video, with complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential effective adaptation which have not been fully considered in prior work. As part our analysis, we introduce challenging V-NIAH-D (Visual Needle-In-A-Haystack Distractors) task,...

10.48550/arxiv.2502.05173 preprint EN arXiv (Cornell University) 2025-02-07

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

OPENALEX - Publications

Yujie Zhou Jiazi Bu Pengyang Ling Pan Zhang Tong Wu and 8 more

Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion have enabled the imposition of consistent lighting. However, video still lags, primarily due to excessive training costs scarcity diverse, high-quality datasets. A simple application models on a frame-by-frame basis leads several issues: lighting source inconsistency relighted appearance inconsistency, resulting flickers generated videos. In this work, we propose Light-A-Video,...

10.48550/arxiv.2502.08590 preprint EN arXiv (Cornell University) 2025-02-12

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

OPENALEX - Publications

Zihan Liu Shuangrui Ding Zhixiong Zhang Xiaoyi Dong Pan Zhang and 4 more

Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical...

10.48550/arxiv.2502.13128 preprint EN arXiv (Cornell University) 2025-02-18

Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

OPENALEX - Publications

Tianqi Liu Guangcong Wang Shoukang Hu Liao Shen Xinyi Ye and 4 more

We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware representations and decode them into parameters. 2) To further enhance performance, propose hybrid rendering integrates an efficient volume design for novel view synthesis. 3) support fast fine-tuning specific scenes, introduce multi-view geometric consistent aggregation...

10.48550/arxiv.2405.12218 preprint EN arXiv (Cornell University) 2024-05-20

1st Place Solutions for OpenImage2019 -- Object Detection and Instance Segmentation

OPENALEX - Publications

Yu Liu Guanglu Song Yuhang Zang Yan Gao Enze Xie and 3 more

This article introduces the solutions of two champion teams, `MMfruit' for detection track and `MMfruitSeg' segmentation track, in OpenImage Challenge 2019. It is commonly known that an object detector, shared feature at end backbone not appropriate both classification regression, which greatly limits performance single stage detector Faster RCNN \cite{ren2015faster} based detector. In this competition, we observe even with a feature, different locations one has completely inconsistent...

10.48550/arxiv.2003.07557 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Are We on the Right Way for Evaluating Large Vision-Language Models?

OPENALEX - Publications

Lin Chen Jinsong Li Xiaoyi Dong Pan Zhang Yuhang Zang and 6 more

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions options, or world knowledge embedded in LLMs. This phenomenon prevalent across benchmarks. For instance, GeminiPro achieves 42.9% on MMMU benchmark without any...

10.48550/arxiv.2403.20330 preprint EN arXiv (Cornell University) 2024-03-29

Coming Soon ...