Shoubin Yu

ORCID: 0009-0006-1670-0054
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Human Pose and Action Recognition
  • Multimodal Machine Learning Applications
  • Video Analysis and Summarization
  • Anomaly Detection Techniques and Applications
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Image and Video Retrieval Techniques
  • Artificial Immune Systems Applications
  • Natural Language Processing Techniques
  • Visual Attention and Saliency Detection
  • Digital Media Forensic Detection
  • Advanced Steganography and Watermarking Techniques
  • Domain Adaptation and Few-Shot Learning
  • Advanced Image Processing Techniques
  • Multimedia Communication and Technology
  • COVID-19 diagnosis using AI
  • Speech and dialogue systems
  • Advanced Vision and Imaging
  • Video Surveillance and Tracking Methods
  • Image Retrieval and Classification Techniques

University of North Carolina at Chapel Hill
2023-2024

Reasoning in the real world is not divorced from situations. How to capture present knowledge surrounding situations and perform reasoning accordingly crucial challenging for machine intelligence. This paper introduces a new benchmark that evaluates situated ability via situation abstraction logic-grounded question answering real-world videos, called Situated Real-World Videos (STAR Benchmark). built upon videos associated with human actions or interactions, which are naturally dynamic,...

10.48550/arxiv.2405.09711 preprint EN arXiv (Cornell University) 2024-05-15

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled frames as visual inputs without explicit language-aware, temporal modeling. When only a portion input is relevant to language query, such uniform frame sampling often lead missing important cues. Although humans find moment focus...

10.48550/arxiv.2305.06988 preprint EN other-oa arXiv (Cornell University) 2023-01-01

10.18653/v1/2024.emnlp-main.1209 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

Anomaly detection in surveillance videos is challenging and important for ensuring public security.Different from pixel-based anomaly methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden also avoids negative impact of background noise.However, unlike could directly exploit explicit motion features such as optical flow, suffer lack alternative dynamic representation.In this paper, a novel Motion Embedder (ME) proposed to provide pose...

10.1109/tcsvt.2023.3296118 article EN IEEE Transactions on Circuits and Systems for Video Technology 2023-07-17

Despite impressive advancements in multimodal compositional reasoning approaches, they are still limited their flexibility and efficiency by processing fixed modality inputs while updating a lot of model parameters. This paper tackles these critical challenges proposes CREMA, an efficient modular modality-fusion framework for injecting any new into video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio) from given videos without extra...

10.48550/arxiv.2402.05889 preprint EN arXiv (Cornell University) 2024-02-08

Video-language understanding tasks have focused on short video clips, often struggling with long-form tasks. Recently, many long video-language approaches leveraged the reasoning capabilities of Large Language Models (LLMs) to perform QA, transforming videos into densely sampled frame captions, and asking LLMs respond text queries over captions. However, frames used for captioning are redundant contain irrelevant information, making dense sampling inefficient, ignoring fact that QA requires...

10.48550/arxiv.2405.19209 preprint EN arXiv (Cornell University) 2024-05-29

We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior understanding methods, which are often costly and require specialized modeling design (e.g., memory queues, state-space layers, etc.), our approach uses frame/clip-level visual captioner BLIP2, LaViLa, LLaVA) coupled with Large Language Model (GPT-3.5, GPT-4) leading to simple yet surprisingly effective LVQA framework. Specifically, we decompose short aspects of into two stages. First,...

10.48550/arxiv.2312.17235 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions input videos, hindering their flexibility to adapt personal/raw videos user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video framework that supports multiple editing capabilities such as removal, addition, modification, through unified pipeline. RACCooN...

10.48550/arxiv.2405.18406 preprint EN arXiv (Cornell University) 2024-05-28

Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from face several challenges: (1) They cannot instantly without training. (2) Their capabilities depend on collected training data. (3) alter model weights, risking degradation quality content unrelated toxic concepts. To...

10.48550/arxiv.2410.12761 preprint EN arXiv (Cornell University) 2024-10-16

Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce Self-Refining Data Flywheel (SRDF) that generates and large-scale navigational instruction-trajectory pairs by iteratively refining the pool through collaboration between two models, instruction generator navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using base to create an initial followed applying trained...

10.48550/arxiv.2412.08467 preprint EN arXiv (Cornell University) 2024-12-11

In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning grounding. This extends existing grounding work focusing on explicit action/motion grounding, more general format by enabling via questions. To facilitate development of task, collect large-scale dataset called GROUNDMORE, which comprises 1,715 video...

10.48550/arxiv.2411.09921 preprint EN arXiv (Cornell University) 2024-11-14

Anomaly detection in surveillance videos is challenging and important for ensuring public security. Different from pixel-based anomaly methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden also avoids negative impact of background noise. However, unlike could directly exploit explicit motion features such as optical flow, suffer lack alternative dynamic representation. In this paper, a novel Motion Embedder (ME) proposed to provide pose...

10.48550/arxiv.2112.03649 preprint EN other-oa arXiv (Cornell University) 2021-01-01
Coming Soon ...