- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Anomaly Detection Techniques and Applications
- Generative Adversarial Networks and Image Synthesis
- Advanced Image and Video Retrieval Techniques
- Artificial Immune Systems Applications
- Natural Language Processing Techniques
- Visual Attention and Saliency Detection
- Digital Media Forensic Detection
- Advanced Steganography and Watermarking Techniques
- Domain Adaptation and Few-Shot Learning
- Advanced Image Processing Techniques
- Multimedia Communication and Technology
- COVID-19 diagnosis using AI
- Speech and dialogue systems
- Advanced Vision and Imaging
- Video Surveillance and Tracking Methods
- Image Retrieval and Classification Techniques
University of North Carolina at Chapel Hill
2023-2024
Reasoning in the real world is not divorced from situations. How to capture present knowledge surrounding situations and perform reasoning accordingly crucial challenging for machine intelligence. This paper introduces a new benchmark that evaluates situated ability via situation abstraction logic-grounded question answering real-world videos, called Situated Real-World Videos (STAR Benchmark). built upon videos associated with human actions or interactions, which are naturally dynamic,...
Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled frames as visual inputs without explicit language-aware, temporal modeling. When only a portion input is relevant to language query, such uniform frame sampling often lead missing important cues. Although humans find moment focus...
Anomaly detection in surveillance videos is challenging and important for ensuring public security.Different from pixel-based anomaly methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden also avoids negative impact of background noise.However, unlike could directly exploit explicit motion features such as optical flow, suffer lack alternative dynamic representation.In this paper, a novel Motion Embedder (ME) proposed to provide pose...
Despite impressive advancements in multimodal compositional reasoning approaches, they are still limited their flexibility and efficiency by processing fixed modality inputs while updating a lot of model parameters. This paper tackles these critical challenges proposes CREMA, an efficient modular modality-fusion framework for injecting any new into video reasoning. We first augment multiple informative modalities (such as optical flow, 3D point cloud, audio) from given videos without extra...
Video-language understanding tasks have focused on short video clips, often struggling with long-form tasks. Recently, many long video-language approaches leveraged the reasoning capabilities of Large Language Models (LLMs) to perform QA, transforming videos into densely sampled frame captions, and asking LLMs respond text queries over captions. However, frames used for captioning are redundant contain irrelevant information, making dense sampling inefficient, ignoring fact that QA requires...
We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior understanding methods, which are often costly and require specialized modeling design (e.g., memory queues, state-space layers, etc.), our approach uses frame/clip-level visual captioner BLIP2, LaViLa, LLaVA) coupled with Large Language Model (GPT-3.5, GPT-4) leading to simple yet surprisingly effective LVQA framework. Specifically, we decompose short aspects of into two stages. First,...
Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions input videos, hindering their flexibility to adapt personal/raw videos user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video framework that supports multiple editing capabilities such as removal, addition, modification, through unified pipeline. RACCooN...
Recent advances in diffusion models have significantly enhanced their ability to generate high-quality images and videos, but they also increased the risk of producing unsafe content. Existing unlearning/editing-based methods for safe generation remove harmful concepts from face several challenges: (1) They cannot instantly without training. (2) Their capabilities depend on collected training data. (3) alter model weights, risking degradation quality content unrelated toxic concepts. To...
Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce Self-Refining Data Flywheel (SRDF) that generates and large-scale navigational instruction-trajectory pairs by iteratively refining the pool through collaboration between two models, instruction generator navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using base to create an initial followed applying trained...
In this paper, we introduce Motion-Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning grounding. This extends existing grounding work focusing on explicit action/motion grounding, more general format by enabling via questions. To facilitate development of task, collect large-scale dataset called GROUNDMORE, which comprises 1,715 video...
Anomaly detection in surveillance videos is challenging and important for ensuring public security. Different from pixel-based anomaly methods, pose-based methods utilize highly-structured skeleton data, which decreases the computational burden also avoids negative impact of background noise. However, unlike could directly exploit explicit motion features such as optical flow, suffer lack alternative dynamic representation. In this paper, a novel Motion Embedder (ME) proposed to provide pose...