Krishna Somandepalli

ORCID: 0000-0002-2845-1079
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Music and Audio Processing
  • Speech and Audio Processing
  • Speech Recognition and Synthesis
  • Video Analysis and Summarization
  • Generative Adversarial Networks and Image Synthesis
  • Attention Deficit Hyperactivity Disorder
  • Sentiment Analysis and Opinion Mining
  • Multimodal Machine Learning Applications
  • Music Technology and Sound Studies
  • Hate Speech and Cyberbullying Detection
  • Face recognition and analysis
  • Emotion and Mood Recognition
  • Video Surveillance and Tracking Methods
  • Functional Brain Connectivity Studies
  • Media Influence and Health
  • Image Retrieval and Classification Techniques
  • Voice and Speech Disorders
  • Human Pose and Action Recognition
  • Autism Spectrum Disorder Research
  • Crime, Deviance, and Social Control
  • Face Recognition and Perception
  • Evolutionary Psychology and Human Behavior
  • Adversarial Robustness in Machine Learning
  • Privacy-Preserving Technologies in Data
  • Aesthetic Perception and Analysis

University of Southern California
2016-2024

Google (United States)
2021-2024

LAC+USC Medical Center
2020-2022

Southern California University for Professional Studies
2019-2020

New York University
2015-2019

NYU Langone Health
2017

To date, only one study has examined test–retest reliability of resting state fMRI (R-fMRI) in children, none clinical developing groups. Here, we assessed short-term a sample 46 children (11–17.9 years) with attention-deficit/hyperactivity disorder (ADHD) and 57 typically (TDC). Our primary measure was the intraclass correlation coefficient (ICC), quantified for range R-fMRI metrics. We aimed to (1) survey within across diagnostic groups, (2) compare voxel-wise ICC between found...

10.1016/j.dcn.2015.08.003 article EN cc-by-nc-nd Developmental Cognitive Neuroscience 2015-08-11

Media is created by humans for to tell stories. There exists a natural and imminent need creating human-centered media analytics illuminate the stories being told understand their impact on individuals society at large. An objective understanding of content has numerous applications different stakeholders, from creators decision-/policy-makers consumers. Advances in multimodal signal processing machine learning (ML) can enable detailed nuanced characterization (of who, what, how, where, why)...

10.1109/jproc.2020.3047978 article EN publisher-specific-oa Proceedings of the IEEE 2021-01-13

Longform media such as movies have complex narrative structures, with events spanning a rich variety of ambient visual scenes. Domain specific challenges associated scenes in include transitions, person coverage, and wide array real-life fictional scenarios. Existing scene datasets limited taxonomies don't consider the transition within movie clips. In this work, we address problem recognition by first automatically curating new extensive movie-centric taxonomy 179 labels derived from...

10.1109/wacv56688.2023.00212 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023-01-01

Core to understanding emotion are subjective experiences and their expression in facial behavior. Past studies have largely focused on six emotions prototypical poses, reflecting limitations scale narrow assumptions about the variety of patterns expression. We examine 45,231 reactions 2,185 evocative videos, North America, Europe, Japan, collecting participants’ self-reported English or Japanese manual automated annotations movement. Guided by Semantic Space Theory, we uncover 21 dimensions...

10.3389/fpsyg.2024.1350631 article EN Frontiers in Psychology 2024-06-20

Violent content in movies can influence viewers’ perception of the society. For example, frequent depictions certain demographics as perpetrators or victims abuse shape stereotyped attitudes. In this work, we propose to characterize aspects violent solely from language used scripts. This makes our method applicable a movie earlier stages creation even before it is produced. complementary previous works which rely on audio video post production. Our approach based broad range features...

10.1609/aaai.v33i01.3301671 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2019-07-17

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from large variety conditioning signals. VideoPoet employs decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows Large Language Models (LLMs), consisting two stages: pretraining task-specific adaptation. During pretraining, incorporates mixture generative objectives within an autoregressive...

10.48550/arxiv.2312.14125 preprint EN cc-by arXiv (Cornell University) 2023-01-01

In this paper, we address the problem of speaker recognition in challenging acoustic conditions using a novel method to extract robust speaker-discriminative speech representations. We adopt recently proposed unsupervised adversarial invariance architecture train network that maps embeddings extracted pretrained model onto two lower dimensional embedding spaces. The spaces are learnt disentangle information from all other present audio recordings, without supervision about conditions....

10.1109/icassp40776.2020.9054601 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Static anatomical and real-time dynamic magnetic resonance imaging (RT-MRI) of the upper airway is a valuable method for studying speech production in research clinical settings. The test–retest repeatability quantitative biomarkers an important parameter, since it limits effect sizes intragroup differences that can be studied. Therefore, this study aims to present framework determining from static MRI RT-MRI, apply healthy volunteers. Subjects (n = 8, 4 females, males) are imaged two scans...

10.1121/1.4983081 article EN The Journal of the Acoustical Society of America 2017-05-01

Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced taxonomies difficult. We present LanSER, a method that enables the use of unlabeled by inferring weak labels via pre-trained language through weakly-supervised learning. For constrained taxonomy, we textual entailment approach selects an label with highest score transcript extracted automatic recognition. Our experimental results show...

10.21437/interspeech.2023-1832 article EN Interspeech 2022 2023-08-14

The prevalent audio-based Voice Activity Detection (VAD) systems are challenged by the presence of ambient noise and sensitive to variations in type noise. use information from visual modality, when available, can help overcome some problems VAD. Existing visual-VAD however do not operate directly on whole image but require intermediate face detection, landmark detection subsequent facial feature extraction lip region. In this work we present an end-to-end trainable Hierarchical Context...

10.1109/icip.2019.8803248 article EN 2022 IEEE International Conference on Image Processing (ICIP) 2019-08-26

Large scale databases with high-quality manual annotations are scarce in audio domain. We thus explore a self-supervised graph approach to learning representations from highly limited labelled data. Considering each sample as node, we propose subgraph-based framework novel self-supervision tasks that can learn effective representations. During training, subgraphs constructed by sampling the entire pool of available training data exploit relationship between and unlabeled samples. inference,...

10.1109/jstsp.2022.3190083 article EN IEEE Journal of Selected Topics in Signal Processing 2022-07-14

Automatic content analysis of animation movies can enable an objective understanding character (actor) representations and their portrayals. It also help illuminate potential markers unconscious biases impact. However, multimedia movie has predominantly focused on live-action features. A dearth research in this field is because the complexity heterogeneity design animated characters-an extremely challenging problem to be generalized by a single method or model. In paper, we address...

10.1109/tmm.2017.2745712 article EN IEEE Transactions on Multimedia 2017-08-28

Arousal and valence have been widely used to represent emotions dimensionally measure them continuously in time. In this paper, we introduce a computational framework for tracking these affective dimensions from multimodal data as an entry the Multimodal Affect Recognition Sub-Challenge of 2016 Audio/Visual Emotion Challenge Workshop (AVEC2016). We propose linear dynamical system approach with late fusion method that accounts dynamics state evolution (i.e., arousal or valence). To end,...

10.1145/2988257.2988259 article EN 2016-10-12

Speech activity detection in highly variable acoustic conditions is a challenging task. Many approaches to detect speech such involve an inherent knowledge of the noise types involved. Movie audio can offer excellent research test-bed for developing models. A robust movie also crucial step subsequent content analyses as diarization. Obtaining labels supervision data be very expensive, and may not scalable. In this paper, we employ simple, yet effective approach obtain by coarse aligning...

10.1109/icassp.2019.8682532 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

The process of human affect understanding involves the ability to infer person specific emotional states from various sources including images, speech, and language. Affect perception images has predominantly focused on expressions extracted salient face crops. However, emotions perceived by humans rely multiple contextual cues social settings, foreground interactions, ambient visual scenes. In this work, we leverage pretrained vision-language (VLN) models extract descriptions context...

10.1109/icassp49357.2023.10095728 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Autism spectrum disorder (ASD) is exceptionally heterogeneous in both clinical and physiopathological presentations. Clinical variability applies to ASD-specific symptoms frequent comorbid psychopathology such as emotional lability (EL). To date, the underpinnings of co-occurrence EL ASD are unknown. As a first step, we examined within-ASD inter-individual its neuronal correlates using resting-state functional magnetic resonance imaging (R-fMRI). We analyzed R-fMRI data from 58 children...

10.1089/brain.2016.0472 article EN Brain Connectivity 2017-05-16

Core to understanding emotion are subjective experiences and their embodiment in facial behavior. Past studies have focused on six emotions prototypical poses, reflecting limitations scale narrow assumptions about emotion. We examine 45,231 reactions 2,185 evocative videos, largely North America, Europe, Japan, collecting participants’ self-reported English or Japanese manual/automated annotations of movement. uncover 21 dimensions underlying reported across languages. Facial expressions...

10.31234/osf.io/gbqtc preprint EN 2021-06-29

The ability to robustly cluster faces in movies is a necessary step understanding media content representations of people along dimensions such as gender and age. Building upon the successes sparse subspace clustering (SSC) uncovering underlying structure data, this paper we propose an algorithm called Constraint Propagation Sparse Subspace Clustering (CP-SSC) for applications face videos where pairwise sample constraints (must-link cannot-link pairs) are available processing pipeline since...

10.1109/icassp.2019.8682314 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Audio event detection is a widely studied field, with applications ranging from self-driving cars to healthcare. In-the-wild datasets such as Audioset have propelled research in this field. However, many efforts typically involve manual annotation and verification, which expensive perform at scale. Movies depict various real-life fictional scenarios makes them rich resource for mining wide range of audio events. In work, we present dataset events called Subtitle-Aligned Movie Sounds (SAM-S)....

10.1109/icassp49357.2023.10094781 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05
Coming Soon ...