- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Speech and dialogue systems
- Emotion and Mood Recognition
- Phonetics and Phonology Research
- Voice and Speech Disorders
- Hand Gesture Recognition Systems
- Blind Source Separation Techniques
- Hearing Impairment and Communication
- Neural Networks and Applications
- Topic Modeling
- Advanced Data Compression Techniques
- Advanced Adaptive Filtering Techniques
- Human Pose and Action Recognition
- Gait Recognition and Analysis
- Text and Document Classification Technologies
- Advanced Memory and Neural Computing
- Indoor and Outdoor Localization Technologies
- Phonocardiography and Auscultation Techniques
- Animal Vocal Communication and Behavior
- Video Analysis and Summarization
- Mental Health via Writing
- Neural dynamics and brain function
Idiap Research Institute
2016-2025
Radboud University Nijmegen
2012
International Computer Science Institute
2007-2008
École Polytechnique Fédérale de Lausanne
2002-2006
Dalle Molle Institute for Artificial Intelligence Research
2003-2004
Universitat Politècnica de Catalunya
2003
University of Washington
1991-1998
Chiron (Norway)
1998
Center Point
1998
IBM (United States)
1981
Automatic speech recognition systems typically model the relationship between acoustic signal and phones in two separate steps: feature extraction classifier training.In our recent works, we have shown that, framework of convolutional neural networks (CNN), raw can be directly modeled ASR competitive to standard approach built.In this paper, first analyze show layers, CNN learns (in parts) models phone-specific spectral envelope information 2-4 ms speech.Given that CNN-based yields trends...
In hybrid hidden Markov model/artificial neural networks (HMM/ANN) automatic speech recognition (ASR) system, the phoneme class conditional probabilities are estimated by first extracting acoustic features from signal based on prior knowledge such as, perception or/and production knowledge, and, then modeling with an ANN.Recent advances in machine learning techniques, more specifically field of image processing and text processing, have shown that divide conquer strategy (i.e., separating...
State-of-the-art automatic speech recognition systems model the relationship between acoustic signal and phone classes in two stages, namely, extraction of spectral-based features based on prior knowledge followed by training model, typically an artificial neural network (ANN). In our recent work, it was shown that Convolutional Neural Networks (CNNs) can from raw signal, reaching performance par with other existing feature-based approaches. This paper extends CNN-based approach to large...
Speaker verification systems traditionally extract and model cepstral features or filter bank energies from the speech signal. In this paper, inspired by success of neural network-based approaches to directly raw signal for applications such as recognition, emotion recognition anti-spoofing, we propose a speaker approach where discriminative information is learned by: (a) first training CNN-based identification system that takes input learns classify on speakers (unknown system); then (b)...
We report on investigations, conducted at the 2006 Johns Hopkins Workshop, into use of articulatory features (AFs) for observation and pronunciation models in speech recognition. In area modeling, we outputs AF classifiers both directly, an extension hybrid HMM/neural network models, as part vector, "tandem" approach. investigate a model having multiple streams states with soft synchrony constraints, audio-only audio-visual The are implemented dynamic Bayesian networks, tested tasks from...
We analyze a simple hierarchical architecture consisting of two multilayer perceptron (MLP) classifiers in tandem to estimate the phonetic class conditional probabilities. In this setup, first MLP classifier is trained using standard acoustic features. The second posterior probabilities phonemes estimated by first, but with long temporal context around 150-230 ms. Through extensive phoneme recognition experiments, and analysis Volterra series, we show that 1) system yields higher...
Biometric systems are exposed to spoofing attacks which may compromise their security, and voice biometrics based on automatic speaker verification (ASV), is no exception. To increase the robustness against such attacks, anti-spoofing have been proposed for detection of replay, synthesis conversion-based attacks. However, most techniques loosely integrated with ASV system. In this work, we develop a new integration neural network jointly processes embeddings extracted from in order detect...
Embarrassment is a social emotion that shares many characteristics with anxiety (SA). Most people experience embarrassment in their daily lives, but it quite overlooked research. We characterized through an interdisciplinary approach, introducing behavioral paradigm and applying machine learning approaches, including acoustic analyses. 33 participants wrote about embarrassing then, without knowing prior, had to read out loud the conductor. was then examined using two different approaches:...
Posterior probabilities of sub-word units have been shown to be an effective front-end for ASR.However, attempts model this type features either do not benefit from modeling context-dependent phonemes, or use inefficient distribution estimate the state likelihood.This paper presents a novel acoustic posterior that overcomes these limitations.The proposed can seen as HMM where score associated with each is KL divergence between characterizing and test utterance.This KL-based establishes...
Development of countermeasures to detect attacks performed on speaker verification systems through presentation forged or altered speech samples is a challenging and open research problem. Typically, this problem approached by extracting features conventional short-term processing feeding them binary classifier. In article, we develop convolutional neural network-based approach that learns in an end-to-end manner both the classifier from raw signal. Through investigations two publicly...
Automatic Gender Recognition (AGR) is the task of identifying gender a speaker given speech signal.Standard approaches extract features like fundamental frequency and cepstral from signal train binary classifier.Inspired recent works in area automatic recognition (ASR), presentation attack detection, we present novel approach where relevant classifier are jointly learned raw end-to-end manner.We propose convolutional neural networks (CNN) based that consists of: (1) convolution layers, which...
Automatic speaker verification systems can be spoofed through recorded, synthetic, or voice converted speech of target speakers. To make these practically viable, the detection such attacks, referred to as presentation is paramount interest. In that direction, this paper investigates two aspects: 1) a novel approach detect attacks where, unlike conventional approaches, no signal modeling related assumptions are made, rather detected by computing first-order and second-order spectral...
During depression neurophysiological changes can occur, which may affect laryngeal control i.e. behaviour of the vocal folds. Characterising these in a precise manner from speech signals is non trivial task, as this typically involves reliable separation voice source information them. In paper, by exploiting abilities CNNs to learn task-relevant input raw signals, we investigate several methods model related for detection. Specifically, modelling low pass filtered linear prediction residual...
In the light of current COVID-19 pandemic, need for remote digital health assessment tools is greater than ever.This statement especially pertinent elderly and vulnerable populations.In this regard, INTERSPEECH 2020 Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) Challenge offers competitors opportunity to develop speech language-based systems task (AD) recognition.The challenge data consists recordings their transcripts, work presented herein an different contemporary...
Respiration is an essential and primary mechanism for speech production. We first inhale then produce while exhaling. When we run out of breath, stop speaking inhale. Though this process involuntary, production involves a systematic outflow air during exhalation characterized by linguistic content prosodic factors the utterance. Thus respiration are closely related, modeling relationship makes sensing respiratory dynamics directly from plausible, however not well explored. In article,...
In this paper, we investigate the significance of contextual information in a phoneme recognition system using hidden Markov model - artificial neural network paradigm. Contextual is probed at feature level as well output multilayered perceptron. At level, analyze and compare different methods to sub-phonemic classes. To exploit perceptron, propose hierarchical estimation posterior probabilities. The best (excluding silence) accuracy 73.4% on TIMIT database comparable that state-of- the-art...