Mathew Magimai.-Doss

ORCID: 0000-0002-8714-1409
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Speech and dialogue systems
  • Emotion and Mood Recognition
  • Phonetics and Phonology Research
  • Voice and Speech Disorders
  • Hand Gesture Recognition Systems
  • Blind Source Separation Techniques
  • Hearing Impairment and Communication
  • Neural Networks and Applications
  • Topic Modeling
  • Advanced Data Compression Techniques
  • Advanced Adaptive Filtering Techniques
  • Human Pose and Action Recognition
  • Gait Recognition and Analysis
  • Text and Document Classification Technologies
  • Advanced Memory and Neural Computing
  • Indoor and Outdoor Localization Technologies
  • Phonocardiography and Auscultation Techniques
  • Animal Vocal Communication and Behavior
  • Video Analysis and Summarization
  • Mental Health via Writing
  • Neural dynamics and brain function

Idiap Research Institute
2016-2025

Radboud University Nijmegen
2012

International Computer Science Institute
2007-2008

École Polytechnique Fédérale de Lausanne
2002-2006

Dalle Molle Institute for Artificial Intelligence Research
2003-2004

Universitat Politècnica de Catalunya
2003

University of Washington
1991-1998

Chiron (Norway)
1998

Center Point
1998

IBM (United States)
1981

Automatic speech recognition systems typically model the relationship between acoustic signal and phones in two separate steps: feature extraction classifier training.In our recent works, we have shown that, framework of convolutional neural networks (CNN), raw can be directly modeled ASR competitive to standard approach built.In this paper, first analyze show layers, CNN learns (in parts) models phone-specific spectral envelope information 2-4 ms speech.Given that CNN-based yields trends...

10.21437/interspeech.2015-3 article EN Interspeech 2022 2015-09-06

In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from signal based on prior knowledge such as, perception or/and production knowledge, and, then modeling with an ANN.Recent advances in machine learning techniques, more specifically field of image processing and text processing, have shown that divide conquer strategy (i.e., separating...

10.21437/interspeech.2013-438 article EN Interspeech 2022 2013-08-25

State-of-the-art automatic speech recognition systems model the relationship between acoustic signal and phone classes in two stages, namely, extraction of spectral-based features based on prior knowledge followed by training model, typically an artificial neural network (ANN). In our recent work, it was shown that Convolutional Neural Networks (CNNs) can from raw signal, reaching performance par with other existing feature-based approaches. This paper extends CNN-based approach to large...

10.1109/icassp.2015.7178781 article EN 2015-04-01

Speaker verification systems traditionally extract and model cepstral features or filter bank energies from the speech signal. In this paper, inspired by success of neural network-based approaches to directly raw signal for applications such as recognition, emotion recognition anti-spoofing, we propose a speaker approach where discriminative information is learned by: (a) first training CNN-based identification system that takes input learns classify on speakers (unknown system); then (b)...

10.1109/icassp.2018.8462165 article EN 2018-04-01

We report on investigations, conducted at the 2006 Johns Hopkins Workshop, into use of articulatory features (AFs) for observation and pronunciation models in speech recognition. In area modeling, we outputs AF classifiers both directly, an extension hybrid HMM/neural network models, as part vector, "tandem" approach. investigate a model having multiple streams states with soft synchrony constraints, audio-only audio-visual The are implemented dynamic Bayesian networks, tested tasks from...

10.1109/icassp.2007.366989 article EN 2007-04-01

We analyze a simple hierarchical architecture consisting of two multilayer perceptron (MLP) classifiers in tandem to estimate the phonetic class conditional probabilities. In this setup, first MLP classifier is trained using standard acoustic features. The second posterior probabilities phonemes estimated by first, but with long temporal context around 150-230 ms. Through extensive phoneme recognition experiments, and analysis Volterra series, we show that 1) system yields higher...

10.1109/tasl.2010.2045943 article EN IEEE Transactions on Audio Speech and Language Processing 2010-03-19

Biometric systems are exposed to spoofing attacks which may compromise their security, and voice biometrics based on automatic speaker verification (ASV), is no exception. To increase the robustness against such attacks, anti-spoofing have been proposed for detection of replay, synthesis conversion-based attacks. However, most techniques loosely integrated with ASV system. In this work, we develop a new integration neural network jointly processes embeddings extracted from in order detect...

10.1109/tifs.2020.3039045 article EN IEEE Transactions on Information Forensics and Security 2020-11-18

10.1109/icassp49660.2025.10888697 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10890852 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10890800 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10889684 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Embarrassment is a social emotion that shares many characteristics with anxiety (SA). Most people experience embarrassment in their daily lives, but it quite overlooked research. We characterized through an interdisciplinary approach, introducing behavioral paradigm and applying machine learning approaches, including acoustic analyses. 33 participants wrote about embarrassing then, without knowing prior, had to read out loud the conductor. was then examined using two different approaches:...

10.1038/s41598-025-94051-9 article EN cc-by-nc-nd Scientific Reports 2025-03-20

Posterior probabilities of sub-word units have been shown to be an effective front-end for ASR.However, attempts model this type features either do not benefit from modeling context-dependent phonemes, or use inefficient distribution estimate the state likelihood.This paper presents a novel acoustic posterior that overcomes these limitations.The proposed can seen as HMM where score associated with each is KL divergence between characterizing and test utterance.This KL-based establishes...

10.21437/interspeech.2008-110 article EN Interspeech 2022 2008-09-22

Development of countermeasures to detect attacks performed on speaker verification systems through presentation forged or altered speech samples is a challenging and open research problem. Typically, this problem approached by extracting features conventional short-term processing feeding them binary classifier. In article, we develop convolutional neural network-based approach that learns in an end-to-end manner both the classifier from raw signal. Through investigations two publicly...

10.1109/btas.2017.8272715 article EN 2017-10-01

Automatic Gender Recognition (AGR) is the task of identifying gender a speaker given speech signal.Standard approaches extract features like fundamental frequency and cepstral from signal train binary classifier.Inspired recent works in area automatic recognition (ASR), presentation attack detection, we present novel approach where relevant classifier are jointly learned raw end-to-end manner.We propose convolutional neural networks (CNN) based that consists of: (1) convolution layers, which...

10.21437/interspeech.2018-1240 article EN Interspeech 2022 2018-08-28

Automatic speaker verification systems can be spoofed through recorded, synthetic, or voice converted speech of target speakers. To make these practically viable, the detection such attacks, referred to as presentation is paramount interest. In that direction, this paper investigates two aspects: 1) a novel approach detect attacks where, unlike conventional approaches, no signal modeling related assumptions are made, rather detected by computing first-order and second-order spectral...

10.1109/taslp.2017.2743340 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-08-23

During depression neurophysiological changes can occur, which may affect laryngeal control i.e. behaviour of the vocal folds. Characterising these in a precise manner from speech signals is non trivial task, as this typically involves reliable separation voice source information them. In paper, by exploiting abilities CNNs to learn task-relevant input raw signals, we investigate several methods model related for detection. Specifically, modelling low pass filtered linear prediction residual...

10.1109/icassp.2019.8683498 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

In the light of current COVID-19 pandemic, need for remote digital health assessment tools is greater than ever.This statement especially pertinent elderly and vulnerable populations.In this regard, INTERSPEECH 2020 Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) Challenge offers competitors opportunity to develop speech language-based systems task (AD) recognition.The challenge data consists recordings their transcripts, work presented herein an different contemporary...

10.21437/interspeech.2020-2635 article EN Interspeech 2022 2020-10-25

Respiration is an essential and primary mechanism for speech production. We first inhale then produce while exhaling. When we run out of breath, stop speaking inhale. Though this process involuntary, production involves a systematic outflow air during exhalation characterized by linguistic content prosodic factors the utterance. Thus respiration are closely related, modeling relationship makes sensing respiratory dynamics directly from plausible, however not well explored. In article,...

10.1016/j.neunet.2021.03.029 article EN cc-by-nc-nd Neural Networks 2021-04-05

In this paper, we investigate the significance of contextual information in a phoneme recognition system using hidden Markov model - artificial neural network paradigm. Contextual is probed at feature level as well output multilayered perceptron. At level, analyze and compare different methods to sub-phonemic classes. To exploit perceptron, propose hierarchical estimation posterior probabilities. The best (excluding silence) accuracy 73.4% on TIMIT database comparable that state-of- the-art...

10.1109/icassp.2008.4518643 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2008-03-01
Coming Soon ...