NFDI4DS | UHH-SEMS - Publication Details

Shoubin Yu

ORCID: 0009-0006-1670-0054

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5059195621

Research Areas

Human Pose and Action Recognition
Multimodal Machine Learning Applications
Video Analysis and Summarization
Anomaly Detection Techniques and Applications
Generative Adversarial Networks and Image Synthesis
Advanced Image and Video Retrieval Techniques
Artificial Immune Systems Applications
Natural Language Processing Techniques
Visual Attention and Saliency Detection
Digital Media Forensic Detection
Advanced Steganography and Watermarking Techniques
Domain Adaptation and Few-Shot Learning
Advanced Image Processing Techniques
Multimedia Communication and Technology
COVID-19 diagnosis using AI
Speech and dialogue systems
Advanced Vision and Imaging
Video Surveillance and Tracking Methods
Image Retrieval and Classification Techniques

University of North Carolina at Chapel Hill
2023-2024

STAR: A Benchmark for Situated Reasoning in Real-World Videos

OPENALEX - Publications

Bo Wu Shoubin Yu Zhenfang Chen Joshua B. Tenenbaum Chuang Gan

Reasoning in the real world is not divorced from situations. How to capture present knowledge surrounding situations and perform reasoning accordingly crucial challenging for machine intelligence. This paper introduces a new benchmark that evaluates situated ability via situation abstraction logic-grounded question answering real-world videos, called Situated Real-World Videos (STAR Benchmark). built upon videos associated with human actions or interactions, which are naturally dynamic,...

10.48550/arxiv.2405.09711 preprint EN arXiv (Cornell University) 2024-05-15

Self-Chained Image-Language Model for Video Localization and Question Answering

OPENALEX - Publications

Shoubin Yu Jaemin Cho Prateek Yadav Mohit Bansal

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled frames as visual inputs without explicit language-aware, temporal modeling. When only a portion input is relevant to language query, such uniform frame sampling often lead missing important cues. Although humans find moment focus...

10.48550/arxiv.2305.06988 preprint EN other-oa arXiv (Cornell University) 2023-01-01

A Simple LLM Framework for Long-Range Video Question-Answering

OPENALEX - Publications

C. Zhang Taixi Lu Md. Mohaiminul Islam Ziyang Wang Shoubin Yu and 2 more

10.18653/v1/2024.emnlp-main.1209 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection

OPENALEX - Publications

Shoubin Yu Zhongyin Zhao Hao-Shu Fang Andong Deng Haisheng Su and 4 more

Anomaly detection in surveillance videos is challenging and important for ensuring public security.Different from pixel-based anomaly methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden also avoids negative impact of background noise.However, unlike could directly exploit explicit motion features such as optical flow, suffer lack alternative dynamic representation.In this paper, a novel Motion Embedder (ME) proposed to provide pose...

10.1109/tcsvt.2023.3296118 article EN IEEE Transactions on Circuits and Systems for Video Technology 2023-07-17

CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion

OPENALEX - Publications

Shoubin Yu Jaehong Yoon Mohit Bansal

Despite impressive advancements in multimodal compositional reasoning approaches, they are still limited their flexibility and efficiency by processing fixed modality inputs while updating a lot of model parameters. This paper tackles these critical challenges proposes CREMA, an efficient modular modality-fusion framework for injecting any new into video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio) from given videos without extra...

10.48550/arxiv.2402.05889 preprint EN arXiv (Cornell University) 2024-02-08

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

OPENALEX - Publications

Ziyang Wang Shoubin Yu Elias Stengel-Eskin Jaehong Yoon Feng Cheng and 2 more

Video-language understanding tasks have focused on short video clips, often struggling with long-form tasks. Recently, many long video-language approaches leveraged the reasoning capabilities of Large Language Models (LLMs) to perform QA, transforming videos into densely sampled frame captions, and asking LLMs respond text queries over captions. However, frames used for captioning are redundant contain irrelevant information, making dense sampling inefficient, ignoring fact that QA requires...

10.48550/arxiv.2405.19209 preprint EN arXiv (Cornell University) 2024-05-29

A Simple LLM Framework for Long-Range Video Question-Answering

OPENALEX - Publications

C. Zhang Taixi Lu Md. Mohaiminul Islam Ziyang Wang Shoubin Yu and 2 more

We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior understanding methods, which are often costly and require specialized modeling design (e.g., memory queues, state-space layers, etc.), our approach uses frame/clip-level visual captioner BLIP2, LaViLa, LLaVA) coupled with Large Language Model (GPT-3.5, GPT-4) leading to simple yet surprisingly effective LVQA framework. Specifically, we decompose short aspects of into two stages. First,...

10.48550/arxiv.2312.17235 preprint EN other-oa arXiv (Cornell University) 2023-01-01

RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives

OPENALEX - Publications

Jaehong Yoon Shoubin Yu Mohit Bansal

Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions input videos, hindering their flexibility to adapt personal/raw videos user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video framework that supports multiple editing capabilities such as removal, addition, modification, through unified pipeline. RACCooN...

10.48550/arxiv.2405.18406 preprint EN arXiv (Cornell University) 2024-05-28

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

OPENALEX - Publications

Jaehong Yoon Shoubin Yu Vaidehi Patil Huaxiu Yao Mohit Bansal

Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from face several challenges: (1) They cannot instantly without training. (2) Their capabilities depend on collected training data. (3) alter model weights, risking degradation quality content unrelated toxic concepts. To...

10.48550/arxiv.2410.12761 preprint EN arXiv (Cornell University) 2024-10-16

Zero-Shot Controllable Image-to-Video Animation via Motion Decomposition

OPENALEX - Publications

Shoubin Yu Jacob Zhiyuan Fang Jian Zheng Gunnar A. Sigurdsson Vicente Ordóñez and 2 more

10.1145/3664647.3681394 article IT 2024-10-26

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

OPENALEX - Publications

Zun Wang Jialu Li Yicong Hong Songze Li Kunchang Li and 6 more

Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce Self-Refining Data Flywheel (SRDF) that generates and large-scale navigational instruction-trajectory pairs by iteratively refining the pool through collaboration between two models, instruction generator navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using base to create an initial followed applying trained...

10.48550/arxiv.2412.08467 preprint EN arXiv (Cornell University) 2024-12-11

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

OPENALEX - Publications

Andong Deng Tongjia Chen Shoubin Yu Taojiannan Yang Lester W. Spencer and 4 more

In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning grounding. This extends existing grounding work focusing on explicit action/motion grounding, more general format by enabling via questions. To facilitate development of task, collect large-scale dataset called GROUNDMORE, which comprises 1,715 video...

10.48550/arxiv.2411.09921 preprint EN arXiv (Cornell University) 2024-11-14

Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection

OPENALEX - Publications

Shoubin Yu Zhongyin Zhao Hao-Shu Fang Andong Deng Haisheng Su and 4 more

Anomaly detection in surveillance videos is challenging and important for ensuring public security. Different from pixel-based anomaly methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden also avoids negative impact of background noise. However, unlike could directly exploit explicit motion features such as optical flow, suffer lack alternative dynamic representation. In this paper, a novel Motion Embedder (ME) proposed to provide pose...

10.48550/arxiv.2112.03649 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Coming Soon ...