- Music and Audio Processing
- Speech and Audio Processing
- Speech Recognition and Synthesis
- Video Analysis and Summarization
- Generative Adversarial Networks and Image Synthesis
- Attention Deficit Hyperactivity Disorder
- Sentiment Analysis and Opinion Mining
- Multimodal Machine Learning Applications
- Music Technology and Sound Studies
- Hate Speech and Cyberbullying Detection
- Face recognition and analysis
- Emotion and Mood Recognition
- Video Surveillance and Tracking Methods
- Functional Brain Connectivity Studies
- Media Influence and Health
- Image Retrieval and Classification Techniques
- Voice and Speech Disorders
- Human Pose and Action Recognition
- Autism Spectrum Disorder Research
- Crime, Deviance, and Social Control
- Face Recognition and Perception
- Evolutionary Psychology and Human Behavior
- Adversarial Robustness in Machine Learning
- Privacy-Preserving Technologies in Data
- Aesthetic Perception and Analysis
University of Southern California
2016-2024
Google (United States)
2021-2024
LAC+USC Medical Center
2020-2022
Southern California University for Professional Studies
2019-2020
New York University
2015-2019
NYU Langone Health
2017
To date, only one study has examined test–retest reliability of resting state fMRI (R-fMRI) in children, none clinical developing groups. Here, we assessed short-term a sample 46 children (11–17.9 years) with attention-deficit/hyperactivity disorder (ADHD) and 57 typically (TDC). Our primary measure was the intraclass correlation coefficient (ICC), quantified for range R-fMRI metrics. We aimed to (1) survey within across diagnostic groups, (2) compare voxel-wise ICC between found...
Media is created by humans for to tell stories. There exists a natural and imminent need creating human-centered media analytics illuminate the stories being told understand their impact on individuals society at large. An objective understanding of content has numerous applications different stakeholders, from creators decision-/policy-makers consumers. Advances in multimodal signal processing machine learning (ML) can enable detailed nuanced characterization (of who, what, how, where, why)...
Longform media such as movies have complex narrative structures, with events spanning a rich variety of ambient visual scenes. Domain specific challenges associated scenes in include transitions, person coverage, and wide array real-life fictional scenarios. Existing scene datasets limited taxonomies don't consider the transition within movie clips. In this work, we address problem recognition by first automatically curating new extensive movie-centric taxonomy 179 labels derived from...
Core to understanding emotion are subjective experiences and their expression in facial behavior. Past studies have largely focused on six emotions prototypical poses, reflecting limitations scale narrow assumptions about the variety of patterns expression. We examine 45,231 reactions 2,185 evocative videos, North America, Europe, Japan, collecting participants’ self-reported English or Japanese manual automated annotations movement. Guided by Semantic Space Theory, we uncover 21 dimensions...
Violent content in movies can influence viewers’ perception of the society. For example, frequent depictions certain demographics as perpetrators or victims abuse shape stereotyped attitudes. In this work, we propose to characterize aspects violent solely from language used scripts. This makes our method applicable a movie earlier stages creation even before it is produced. complementary previous works which rely on audio video post production. Our approach based broad range features...
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from large variety conditioning signals. VideoPoet employs decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows Large Language Models (LLMs), consisting two stages: pretraining task-specific adaptation. During pretraining, incorporates mixture generative objectives within an autoregressive...
In this paper, we address the problem of speaker recognition in challenging acoustic conditions using a novel method to extract robust speaker-discriminative speech representations. We adopt recently proposed unsupervised adversarial invariance architecture train network that maps embeddings extracted pretrained model onto two lower dimensional embedding spaces. The spaces are learnt disentangle information from all other present audio recordings, without supervision about conditions....
Static anatomical and real-time dynamic magnetic resonance imaging (RT-MRI) of the upper airway is a valuable method for studying speech production in research clinical settings. The test–retest repeatability quantitative biomarkers an important parameter, since it limits effect sizes intragroup differences that can be studied. Therefore, this study aims to present framework determining from static MRI RT-MRI, apply healthy volunteers. Subjects (n = 8, 4 females, males) are imaged two scans...
Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced taxonomies difficult. We present LanSER, a method that enables the use of unlabeled by inferring weak labels via pre-trained language through weakly-supervised learning. For constrained taxonomy, we textual entailment approach selects an label with highest score transcript extracted automatic recognition. Our experimental results show...
The prevalent audio-based Voice Activity Detection (VAD) systems are challenged by the presence of ambient noise and sensitive to variations in type noise. use information from visual modality, when available, can help overcome some problems VAD. Existing visual-VAD however do not operate directly on whole image but require intermediate face detection, landmark detection subsequent facial feature extraction lip region. In this work we present an end-to-end trainable Hierarchical Context...
Large scale databases with high-quality manual annotations are scarce in audio domain. We thus explore a self-supervised graph approach to learning representations from highly limited labelled data. Considering each sample as node, we propose subgraph-based framework novel self-supervision tasks that can learn effective representations. During training, subgraphs constructed by sampling the entire pool of available training data exploit relationship between and unlabeled samples. inference,...
Automatic content analysis of animation movies can enable an objective understanding character (actor) representations and their portrayals. It also help illuminate potential markers unconscious biases impact. However, multimedia movie has predominantly focused on live-action features. A dearth research in this field is because the complexity heterogeneity design animated characters-an extremely challenging problem to be generalized by a single method or model. In paper, we address...
Arousal and valence have been widely used to represent emotions dimensionally measure them continuously in time. In this paper, we introduce a computational framework for tracking these affective dimensions from multimodal data as an entry the Multimodal Affect Recognition Sub-Challenge of 2016 Audio/Visual Emotion Challenge Workshop (AVEC2016). We propose linear dynamical system approach with late fusion method that accounts dynamics state evolution (i.e., arousal or valence). To end,...
Speech activity detection in highly variable acoustic conditions is a challenging task. Many approaches to detect speech such involve an inherent knowledge of the noise types involved. Movie audio can offer excellent research test-bed for developing models. A robust movie also crucial step subsequent content analyses as diarization. Obtaining labels supervision data be very expensive, and may not scalable. In this paper, we employ simple, yet effective approach obtain by coarse aligning...
The process of human affect understanding involves the ability to infer person specific emotional states from various sources including images, speech, and language. Affect perception images has predominantly focused on expressions extracted salient face crops. However, emotions perceived by humans rely multiple contextual cues social settings, foreground interactions, ambient visual scenes. In this work, we leverage pretrained vision-language (VLN) models extract descriptions context...
Autism spectrum disorder (ASD) is exceptionally heterogeneous in both clinical and physiopathological presentations. Clinical variability applies to ASD-specific symptoms frequent comorbid psychopathology such as emotional lability (EL). To date, the underpinnings of co-occurrence EL ASD are unknown. As a first step, we examined within-ASD inter-individual its neuronal correlates using resting-state functional magnetic resonance imaging (R-fMRI). We analyzed R-fMRI data from 58 children...
Core to understanding emotion are subjective experiences and their embodiment in facial behavior. Past studies have focused on six emotions prototypical poses, reflecting limitations scale narrow assumptions about emotion. We examine 45,231 reactions 2,185 evocative videos, largely North America, Europe, Japan, collecting participants’ self-reported English or Japanese manual/automated annotations of movement. uncover 21 dimensions underlying reported across languages. Facial expressions...
The ability to robustly cluster faces in movies is a necessary step understanding media content representations of people along dimensions such as gender and age. Building upon the successes sparse subspace clustering (SSC) uncovering underlying structure data, this paper we propose an algorithm called Constraint Propagation Sparse Subspace Clustering (CP-SSC) for applications face videos where pairwise sample constraints (must-link cannot-link pairs) are available processing pipeline since...
Audio event detection is a widely studied field, with applications ranging from self-driving cars to healthcare. In-the-wild datasets such as Audioset have propelled research in this field. However, many efforts typically involve manual annotation and verification, which expensive perform at scale. Movies depict various real-life fictional scenarios makes them rich resource for mining wide range of audio events. In work, we present dataset events called Subtitle-Aligned Movie Sounds (SAM-S)....