Tanaya Guha

ORCID: 0000-0003-2167-4891
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Music and Audio Processing
  • Speech and Audio Processing
  • Video Surveillance and Tracking Methods
  • Emotion and Mood Recognition
  • Video Analysis and Summarization
  • Human Pose and Action Recognition
  • Image and Video Quality Assessment
  • Speech Recognition and Synthesis
  • Visual Attention and Saliency Detection
  • Face recognition and analysis
  • Advanced Image and Video Retrieval Techniques
  • Anomaly Detection Techniques and Applications
  • Multimodal Machine Learning Applications
  • Image Retrieval and Classification Techniques
  • Autonomous Vehicle Technology and Safety
  • Face and Expression Recognition
  • Autism Spectrum Disorder Research
  • Music Technology and Sound Studies
  • Neuroscience and Music Perception
  • Sentiment Analysis and Opinion Mining
  • Mental Health via Writing
  • Mental Health Research Topics
  • Advanced Image Fusion Techniques
  • Advanced Graph Neural Networks
  • Generative Adversarial Networks and Image Synthesis

University of Glasgow
2022-2025

University of Warwick
2019-2022

Indian Institute of Technology Kanpur
2016-2018

University of Southern California
2014-2016

University of British Columbia
2010-2014

This paper explores the effectiveness of sparse representations obtained by learning a set overcomplete basis (dictionary) in context action recognition videos. Although this work concentrates on recognizing human movements-physical actions as well facial expressions-the proposed approach is fairly general and can be used to address other classification problems. In order model actions, three dictionary frameworks are investigated. An constructed using spatio-temporal descriptors (extracted...

10.1109/tpami.2011.253 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2011-12-29

Depression is one of the most common mood disorders. Technology has potential to assist in screening and treating people with depression by robustly modeling tracking complex behavioral cues associated disorder (e.g., speech, language, facial expressions, head movement, body language). Similarly, robust affect recognition another challenge which stands benefit from such cues. The Audio/Visual Emotion Challenge (AVEC) aims toward understanding two phenomena their correlation observable across...

10.1145/2661806.2661810 article EN 2014-11-03

Several studies have established that facial expressions of children with autism are often perceived as atypical, awkward or less engaging by typical adult observers. Despite this clear deficit in the quality expression production, very little is understood about its underlying mechanisms and characteristics. This paper takes a computational approach to studying details high functioning (HFA). The objective uncover those characteristics expressions, notably distinct from typically developing...

10.1109/taffc.2016.2578316 article EN publisher-specific-oa IEEE Transactions on Affective Computing 2016-06-08

The mainstream image captioning models rely on Convolutional Neural Network (CNN) features to generate captions via recurrent models. Recently, scene graphs have been used augment so as leverage their structural semantics, such object entities, relationships and attributes. Several studies noted that the naive use of from a black-box graph generator harms performance graph-based incur overhead explicit decent captions. Addressing these challenges, we propose SG2Caps, framework utilizes only...

10.1109/iccv48922.2021.00144 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

State-of-the-art crowd counting models follow an encoder-decoder approach. Images are first processed by the encoder to extract features. Then, account for perspective distortion, highest-level feature map is fed extra components multiscale features, which input decoder generate densities. However, in these methods, features extracted at earlier stages during encoding underutilised, and modules can only capture a limited range of receptive fields, albeit with considerable computational cost....

10.1109/icip46576.2022.9897322 article EN 2022 IEEE International Conference on Image Processing (ICIP) 2022-10-16

We propose a deep graph approach to address the task of speech emotion recognition. A compact, efficient and scalable way represent data is in form graphs. Following theory signal processing, we model as cycle or line graph. Such structure enables us construct Graph Convolution Network (GCN)-based architecture that can perform an accurate convolution contrast approximate used standard GCNs. evaluated performance our for recognition on popular IEMOCAP MSP-IMPROV databases. Our outperforms GCN...

10.1109/icassp39728.2021.9413876 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Human emotion is expressed, perceived and captured using a variety of dynamic data modalities, such as speech (verbal), videos (facial expressions) motion sensors (body gestures). We propose generalized approach to recognition that can adapt across modalities by modeling structured graphs. The motivation behind the graph build compact models without compromising on performance. To alleviate problem optimal construction, we cast this joint learning classification task. end, present Learnable...

10.1109/tmm.2021.3059169 article EN IEEE Transactions on Multimedia 2021-02-17

Media is created by humans for to tell stories. There exists a natural and imminent need creating human-centered media analytics illuminate the stories being told understand their impact on individuals society at large. An objective understanding of content has numerous applications different stakeholders, from creators decision-/policy-makers consumers. Advances in multimodal signal processing machine learning (ML) can enable detailed nuanced characterization (of who, what, how, where, why)...

10.1109/jproc.2020.3047978 article EN publisher-specific-oa Proceedings of the IEEE 2021-01-13

We explore the efficacy of multimodal behavioral cues for explainable prediction personality and interview -specific traits. utilize elementary head-motion units named kinemes , atomic facial movements termed action speech features to estimate these human-centered Empirical results confirm that enable discovery multiple trait-specific behaviors while also enabling explainability in support predictions. For fusing cues, we decision feature-level fusion, an additive attention-based fusion...

10.1371/journal.pone.0313883 article EN cc-by PLoS ONE 2025-01-17

10.1109/icassp49660.2025.10889429 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

A new line of research uses compression methods to measure the similarity between signals. Two signals are considered similar if one can be compressed significantly when information other is known. The existing compression-based methods, although successful in discrete dimensional domain, do not work well context images. This paper proposes a sparse representation-based approach encode content an image using from image, and compactness (sparsity) representation as its compressibility (how...

10.1109/tmm.2014.2306175 article EN IEEE Transactions on Multimedia 2014-02-13

This paper introduces the problem of multiple object forecasting (MOF), in which goal is to predict future bounding boxes tracked objects. In contrast existing works on trajectory primarily consider from a birds-eye perspective, we formulate an object-level perspective and call for prediction full boxes, rather than trajectories alone. Towards solving this task, introduce Citywalks dataset, consists over 200k high-resolution video frames. comprises footage recorded 21 cities 10 European...

10.1109/wacv45572.2020.9093446 article EN 2020-03-01

We present an audio-visual multimodal approach for the task of zero-shot learning (ZSL) classification and retrieval videos. ZSL has been studied extensively in recent past but primarily limited to visual modality images. demonstrate that both audio modalities are important Since a dataset study is currently not available, we also construct appropriate with 33 classes containing 156, 416 videos, from existing large scale event dataset. empirically show performance improves by adding tasks...

10.1109/wacv45572.2020.9093438 article EN 2020-03-01

Children with Autism Spectrum Disorder (ASD) are known to have difficulty in producing and perceiving emotional facial expressions. Their expressions often perceived as atypical by adult observers. This paper focuses on data driven ways analyze quantify atypicality of children ASD. Our objective is uncover those characteristics gestures that induce the sense Using a carefully collected motion capture database, without ASD compared within six basic emotion categories employing methods from...

10.1109/icassp.2015.7178080 article EN 2015-04-01

This paper addresses the problem of continuous emotion prediction in movies from multimodal cues. The rich content is inherently multimodal, where evoked through both audio (music, speech) and video modalities. To capture such affective information, we put forth a set features that includes several novel as, Video Compressibility Histogram Facial Area (HFA). We propose Mixture Experts (MoE)-based fusion model dynamically combines information modalities for predicting movies. A learning...

10.1109/icassp.2016.7472192 article EN 2016-03-01

We introduce the problem of learning affective correspondence between audio (music) and visual data (images). For this task, a music clip an image are considered similar (having true correspondence) if they have emotion content. In order to estimate crossmodal, emotion-centric similarity, we propose deep neural network architecture that learns project from two modalities common representation space, performs binary classification task predicting (true or false). To facilitate current study,...

10.1109/icassp.2019.8683133 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

10.1016/j.image.2014.09.010 article EN Signal Processing Image Communication 2014-10-16

The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due the inherent challenges of transforming counting--a regression task--into a task. In this paper, we investigate CLIP's potential counting, focusing specifically on estimating crowd sizes. Existing classification-based crowd-counting methods have encountered...

10.48550/arxiv.2403.09281 preprint EN arXiv (Cornell University) 2024-03-14

In general, popular films and screenplays follow a well defined storytelling paradigm that comprises three essential segments or acts: exposition (act I), conflict II) resolution III). Deconstructing movie into its narrative units can enrich semantic understanding of movies, help in summarization, navigation detection the key events. A multimodal framework for detecting such act structure is developed this paper. Various low-level features are designed extracted from video, audio text...

10.1109/icassp.2015.7178374 article EN 2015-04-01

This work proposes a trajectory clustering-based approach for segmenting flow patterns in high density crowd videos. The goal is to produce pixel-wise segmentation of video sequence (static camera), where each segment corresponds different motion pattern. Unlike previous studies that use only vectors, we extract full trajectories so as capture the complete temporal evolution region (block) sequence. extracted are dense, complex and often overlapping. A novel clustering algorithm developed...

10.1109/icip.2016.7532548 article EN 2022 IEEE International Conference on Image Processing (ICIP) 2016-08-17

We introduce the task of multi-camera trajectory forecasting (MCTF), where future an object is predicted in a network cameras. Prior works consider trajectories single camera view. Our work first to challenging scenario across multiple non-overlapping views. This has wide applicability tasks such as re-identification and multi-target tracking. To facilitate research this new area, we release Warwick-NTU Multi-camera Forecasting Database (WNMF), unique dataset pedestrian from 15 synchronized...

10.1109/cvprw50498.2020.00516 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2020-06-01

Large scale databases with high-quality manual annotations are scarce in audio domain. We thus explore a self-supervised graph approach to learning representations from highly limited labelled data. Considering each sample as node, we propose subgraph-based framework novel self-supervision tasks that can learn effective representations. During training, subgraphs constructed by sampling the entire pool of available training data exploit relationship between and unlabeled samples. inference,...

10.1109/jstsp.2022.3190083 article EN IEEE Journal of Selected Topics in Signal Processing 2022-07-14

Non-verbal behavioral cues, such as head movement, play a significant role in human communication and affective expression. Although facial expression gestures have been extensively studied the context of emotion understanding, motion (which accompany both) is relatively less understood. This paper studies significance movement adult's affect using videos from movies. These are taken Acted Facial Expression Wild (AFEW) database labeled with seven basic categories: anger, disgust, fear, joy,...

10.1109/icassp.2017.7952684 article EN 2017-03-01
Coming Soon ...