- Music and Audio Processing
- Speech and Audio Processing
- Video Surveillance and Tracking Methods
- Emotion and Mood Recognition
- Video Analysis and Summarization
- Human Pose and Action Recognition
- Image and Video Quality Assessment
- Speech Recognition and Synthesis
- Visual Attention and Saliency Detection
- Face recognition and analysis
- Advanced Image and Video Retrieval Techniques
- Anomaly Detection Techniques and Applications
- Multimodal Machine Learning Applications
- Image Retrieval and Classification Techniques
- Autonomous Vehicle Technology and Safety
- Face and Expression Recognition
- Autism Spectrum Disorder Research
- Music Technology and Sound Studies
- Neuroscience and Music Perception
- Sentiment Analysis and Opinion Mining
- Mental Health via Writing
- Mental Health Research Topics
- Advanced Image Fusion Techniques
- Advanced Graph Neural Networks
- Generative Adversarial Networks and Image Synthesis
University of Glasgow
2022-2025
University of Warwick
2019-2022
Indian Institute of Technology Kanpur
2016-2018
University of Southern California
2014-2016
University of British Columbia
2010-2014
This paper explores the effectiveness of sparse representations obtained by learning a set overcomplete basis (dictionary) in context action recognition videos. Although this work concentrates on recognizing human movements-physical actions as well facial expressions-the proposed approach is fairly general and can be used to address other classification problems. In order model actions, three dictionary frameworks are investigated. An constructed using spatio-temporal descriptors (extracted...
Depression is one of the most common mood disorders. Technology has potential to assist in screening and treating people with depression by robustly modeling tracking complex behavioral cues associated disorder (e.g., speech, language, facial expressions, head movement, body language). Similarly, robust affect recognition another challenge which stands benefit from such cues. The Audio/Visual Emotion Challenge (AVEC) aims toward understanding two phenomena their correlation observable across...
Several studies have established that facial expressions of children with autism are often perceived as atypical, awkward or less engaging by typical adult observers. Despite this clear deficit in the quality expression production, very little is understood about its underlying mechanisms and characteristics. This paper takes a computational approach to studying details high functioning (HFA). The objective uncover those characteristics expressions, notably distinct from typically developing...
The mainstream image captioning models rely on Convolutional Neural Network (CNN) features to generate captions via recurrent models. Recently, scene graphs have been used augment so as leverage their structural semantics, such object entities, relationships and attributes. Several studies noted that the naive use of from a black-box graph generator harms performance graph-based incur overhead explicit decent captions. Addressing these challenges, we propose SG2Caps, framework utilizes only...
State-of-the-art crowd counting models follow an encoder-decoder approach. Images are first processed by the encoder to extract features. Then, account for perspective distortion, highest-level feature map is fed extra components multiscale features, which input decoder generate densities. However, in these methods, features extracted at earlier stages during encoding underutilised, and modules can only capture a limited range of receptive fields, albeit with considerable computational cost....
We propose a deep graph approach to address the task of speech emotion recognition. A compact, efficient and scalable way represent data is in form graphs. Following theory signal processing, we model as cycle or line graph. Such structure enables us construct Graph Convolution Network (GCN)-based architecture that can perform an accurate convolution contrast approximate used standard GCNs. evaluated performance our for recognition on popular IEMOCAP MSP-IMPROV databases. Our outperforms GCN...
Human emotion is expressed, perceived and captured using a variety of dynamic data modalities, such as speech (verbal), videos (facial expressions) motion sensors (body gestures). We propose generalized approach to recognition that can adapt across modalities by modeling structured graphs. The motivation behind the graph build compact models without compromising on performance. To alleviate problem optimal construction, we cast this joint learning classification task. end, present Learnable...
Media is created by humans for to tell stories. There exists a natural and imminent need creating human-centered media analytics illuminate the stories being told understand their impact on individuals society at large. An objective understanding of content has numerous applications different stakeholders, from creators decision-/policy-makers consumers. Advances in multimodal signal processing machine learning (ML) can enable detailed nuanced characterization (of who, what, how, where, why)...
We explore the efficacy of multimodal behavioral cues for explainable prediction personality and interview -specific traits. utilize elementary head-motion units named kinemes , atomic facial movements termed action speech features to estimate these human-centered Empirical results confirm that enable discovery multiple trait-specific behaviors while also enabling explainability in support predictions. For fusing cues, we decision feature-level fusion, an additive attention-based fusion...
A new line of research uses compression methods to measure the similarity between signals. Two signals are considered similar if one can be compressed significantly when information other is known. The existing compression-based methods, although successful in discrete dimensional domain, do not work well context images. This paper proposes a sparse representation-based approach encode content an image using from image, and compactness (sparsity) representation as its compressibility (how...
This paper introduces the problem of multiple object forecasting (MOF), in which goal is to predict future bounding boxes tracked objects. In contrast existing works on trajectory primarily consider from a birds-eye perspective, we formulate an object-level perspective and call for prediction full boxes, rather than trajectories alone. Towards solving this task, introduce Citywalks dataset, consists over 200k high-resolution video frames. comprises footage recorded 21 cities 10 European...
We present an audio-visual multimodal approach for the task of zero-shot learning (ZSL) classification and retrieval videos. ZSL has been studied extensively in recent past but primarily limited to visual modality images. demonstrate that both audio modalities are important Since a dataset study is currently not available, we also construct appropriate with 33 classes containing 156, 416 videos, from existing large scale event dataset. empirically show performance improves by adding tasks...
Children with Autism Spectrum Disorder (ASD) are known to have difficulty in producing and perceiving emotional facial expressions. Their expressions often perceived as atypical by adult observers. This paper focuses on data driven ways analyze quantify atypicality of children ASD. Our objective is uncover those characteristics gestures that induce the sense Using a carefully collected motion capture database, without ASD compared within six basic emotion categories employing methods from...
This paper addresses the problem of continuous emotion prediction in movies from multimodal cues. The rich content is inherently multimodal, where evoked through both audio (music, speech) and video modalities. To capture such affective information, we put forth a set features that includes several novel as, Video Compressibility Histogram Facial Area (HFA). We propose Mixture Experts (MoE)-based fusion model dynamically combines information modalities for predicting movies. A learning...
We introduce the problem of learning affective correspondence between audio (music) and visual data (images). For this task, a music clip an image are considered similar (having true correspondence) if they have emotion content. In order to estimate crossmodal, emotion-centric similarity, we propose deep neural network architecture that learns project from two modalities common representation space, performs binary classification task predicting (true or false). To facilitate current study,...
The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due the inherent challenges of transforming counting--a regression task--into a task. In this paper, we investigate CLIP's potential counting, focusing specifically on estimating crowd sizes. Existing classification-based crowd-counting methods have encountered...
In general, popular films and screenplays follow a well defined storytelling paradigm that comprises three essential segments or acts: exposition (act I), conflict II) resolution III). Deconstructing movie into its narrative units can enrich semantic understanding of movies, help in summarization, navigation detection the key events. A multimodal framework for detecting such act structure is developed this paper. Various low-level features are designed extracted from video, audio text...
This work proposes a trajectory clustering-based approach for segmenting flow patterns in high density crowd videos. The goal is to produce pixel-wise segmentation of video sequence (static camera), where each segment corresponds different motion pattern. Unlike previous studies that use only vectors, we extract full trajectories so as capture the complete temporal evolution region (block) sequence. extracted are dense, complex and often overlapping. A novel clustering algorithm developed...
We introduce the task of multi-camera trajectory forecasting (MCTF), where future an object is predicted in a network cameras. Prior works consider trajectories single camera view. Our work first to challenging scenario across multiple non-overlapping views. This has wide applicability tasks such as re-identification and multi-target tracking. To facilitate research this new area, we release Warwick-NTU Multi-camera Forecasting Database (WNMF), unique dataset pedestrian from 15 synchronized...
Large scale databases with high-quality manual annotations are scarce in audio domain. We thus explore a self-supervised graph approach to learning representations from highly limited labelled data. Considering each sample as node, we propose subgraph-based framework novel self-supervision tasks that can learn effective representations. During training, subgraphs constructed by sampling the entire pool of available training data exploit relationship between and unlabeled samples. inference,...
Non-verbal behavioral cues, such as head movement, play a significant role in human communication and affective expression. Although facial expression gestures have been extensively studied the context of emotion understanding, motion (which accompany both) is relatively less understood. This paper studies significance movement adult's affect using videos from movies. These are taken Acted Facial Expression Wild (AFEW) database labeled with seven basic categories: anger, disgust, fear, joy,...