- Music and Audio Processing
- Speech and Audio Processing
- Advanced Vision and Imaging
- Video Analysis and Summarization
- Generative Adversarial Networks and Image Synthesis
- Music Technology and Sound Studies
- Digital Media Forensic Detection
- Computer Graphics and Visualization Techniques
- Advanced Image and Video Retrieval Techniques
- Interactive and Immersive Displays
- Anomaly Detection Techniques and Applications
- Image and Signal Denoising Methods
- Video Surveillance and Tracking Methods
- Image Processing and 3D Reconstruction
- Tactile and Sensory Interactions
- Hearing Loss and Rehabilitation
- Target Tracking and Data Fusion in Sensor Networks
- Human Pose and Action Recognition
- Cell Image Analysis Techniques
- Hand Gesture Recognition Systems
- Advanced Optical Imaging Technologies
- Time Series Analysis and Forecasting
- Human Motion and Animation
- Visual Attention and Saliency Detection
- Animal Vocal Communication and Behavior
University of Michigan–Ann Arbor
2023-2024
Michigan United
2023
Massachusetts Institute of Technology
2016
We present Text2Room <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> , a method for generating room-scale textured 3D meshes from given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize sequence of images different poses. In order lift these outputs into consistent scene representation, combine monocular depth estimation with text-conditioned inpainting model. The core idea our approach is...
Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, be trained solely using real, unlabeled data. train an autoregressive model to generate sequences of audio-visual features, feature sets capture the temporal synchronization frames sound. At test time, we then flag assigns low probability. Despite being entirely real videos, our obtains strong...
How does audio describe the world around us? In this paper, we propose a method for generating an image of scene from sound. Our addresses challenges dealing with large gaps that often exist between sight and We design model works by scheduling learning procedure each component to associate audio-visual modalities despite their information gaps. The key idea is enrich features visual align latent space. translate input features, then use pre-trained generator produce image. To further...
The sound effects that designers add to videos are designed convey a particular artistic effect and, thus, may be quite different from scene's true sound. Inspired by the challenges of creating soundtrack for video differs its sound, but nonetheless matches actions occurring on screen, we propose problem conditional Foley. We present following contributions address this problem. First, pretext task training our model predict an input clip using audio-visual sampled another time within same...
We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. train GPS-to-image models and use them tasks require fine-grained understanding of how images vary within city. In particular, we diffusion model to generate conditioned on both text. The learned generates capture distinctive appearance different neighborhoods, parks, landmarks. also extract 3D from 2D through score distillation sampling, using conditioning constrain reconstruction...
An emerging line of work has sought to generate plausible imagery from touch. Existing approaches, however, tackle only narrow aspects the visuo-tactile synthesis problem, and lag significantly behind quality cross-modal methods in other domains. We draw on recent advances latent diffusion create a model for synthesizing images tactile signals (and vice versa) apply it number tasks. Using this model, we outperform prior tactile-driven stylization i.e., manipulating an image match touch...
We propose a method that learns to camouflage 3D objects within scenes. Given an object's shape and distribution of viewpoints from which it will be seen, we estimate texture make difficult detect. Successfully solving this task requires model can accurately reproduce textures the scene, while simultaneously dealing with highly conflicting constraints imposed by each viewpoint. address these challenges based on fields adversarial learning. Our variety object shapes randomly sampled locations...
We learn a visual representation that captures information about the camera recorded given photo. To do this, we train multimodal embedding between image patches and EXIF metadata cameras automatically insert into files. Our model represents this meta-data by simply converting it to text then processing with transformer. The features significantly outperform other self-supervised supervised on downstream forensics calibration tasks. In particular, successfully localize spliced regions "zero...
The images and sounds that we perceive undergo subtle but geometrically consistent changes as rotate our heads. In this paper, use these cues to solve a problem call Sound Localization from Motion (SLfM): jointly estimating camera rotation localizing sound sources. We learn tasks solely through self-supervision. A visual model predicts pair of images, while an audio the direction sources binaural sounds. train models generate predictions agree with one another. At test time, can be deployed...
We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of learned representations and training speed. During each iteration training, we randomly mask clusters visually similar patches, as measured by their raw pixel intensities. This provides an extra signal, beyond itself, since it forces model to predict words masked visual structures solely from context. It also speeds up reducing amount data used in image. evaluate...
We present a scene representation, which we call tactile-augmented radiance field (TaRF), that brings vision and touch into shared 3D space. This representation can be used to estimate the visual tactile signals for given position within scene. capture scene's TaRF from collection of photos sparsely sampled probes. Our approach makes use two insights: (i) common vision-based sensors are built on ordinary cameras thus registered images using methods multi-view geometry, (ii) visually...
We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. train global matching transformer find cycle consistent tracks through video via contrastive random walks, using transformer's attention-based define transition matrices for walk on space-time graph. The ability perform "all pairs" comparisons between points allows model obtain high spatial precision and strong learning signal, while avoiding many of complexities recent approaches (such as coarse-to-fine...
Generating sound effects for videos often requires creating artistic that diverge significantly from real-life sources and flexible control in the design. To address this problem, we introduce MultiFoley, a model designed video-guided generation supports multimodal conditioning through text, audio, video. Given silent video text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical making lion's roar like cat's meow)....
Motion control is crucial for generating expressive and compelling video content; however, most existing generation models rely mainly on text prompts control, which struggle to capture the nuances of dynamic actions temporal compositions. To this end, we train a model conditioned spatio-temporally sparse or dense motion trajectories. In contrast prior conditioning work, flexible representation can encode any number trajectories, object-specific global scene motion, temporally motion; due...
We present Text2Room, a method for generating room-scale textured 3D meshes from given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize sequence of images different poses. In order lift these outputs into consistent scene representation, combine monocular depth estimation with text-conditioned inpainting model. The core idea our approach is tailored viewpoint selection such that the content each image can be fused seamless, mesh. More...
Humans are remarkably sensitive to the alignment of visual events with other stimuli, which makes synchronization one hardest tasks in video editing. A key observation our work is that most we do involves salient localizable occur sparsely time. By learning how recognize these events, can greatly reduce space possible synchronizations an editor or algorithm has consider. Furthermore, by descriptors capture additional properties visible motion, build active tools adapt their notion...
Human speech is often accompanied by hand and arm gestures. Given audio input, we generate plausible gestures to go along with the sound. Specifically, perform cross-modal translation from "in-the-wild'' monologue of a single speaker their motion. We train on unlabeled videos for which only have noisy pseudo ground truth an automatic pose detection system. Our proposed model significantly outperforms baseline methods in quantitative comparison. To support research toward obtaining...
The sound of crashing waves, the roar fast-moving cars -- conveys important information about objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, train convolutional neural network to predict statistical summary associated with video frame. We that, through process, learns representation and scenes. evaluate on several recognition tasks, finding its performance is comparable other...
Objects make distinctive sounds when they are hit or scratched. These reveal aspects of an object's material properties, as well the actions that produced them. In this paper, we propose task predicting what sound object makes struck a way studying physical interactions within visual scene. We present algorithm synthesizes from silent videos people hitting and scratching objects with drumstick. This uses recurrent neural network to predict features then produces waveform these example-based...
Advances in photo editing and manipulation tools have made it significantly easier to create fake imagery. Learning detect such manipulations, however, remains a challenging problem due the lack of sufficient amounts manipulated training data. In this paper, we propose learning algorithm for detecting visual image manipulations that is trained only using large dataset real photographs. The uses automatically recorded EXIF metadata as supervisory signal model determine whether an...
Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, be trained solely using real, unlabeled data. train an autoregressive model to generate sequences of audio-visual features, feature sets capture the temporal synchronization frames sound. At test time, we then flag assigns low probability. Despite being entirely real videos, our obtains strong...