- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Speech and dialogue systems
- Voice and Speech Disorders
- Natural Language Processing Techniques
- Phonetics and Phonology Research
- Topic Modeling
- Advanced Data Compression Techniques
- Neural Networks and Applications
- Multimodal Machine Learning Applications
- Intelligent Tutoring Systems and Adaptive Learning
- Algorithms and Data Compression
- Advanced Adaptive Filtering Techniques
- Emotion and Mood Recognition
- Subtitles and Audiovisual Media
- Language, Metaphor, and Cognition
- ICT in Developing Communities
- Sentiment Analysis and Opinion Mining
- Multi-Agent Systems and Negotiation
- Social Robot Interaction and HRI
- Innovative Teaching and Learning Methods
Amazon (United States)
2022
Amazon (United Kingdom)
2021
Aalto University
2016-2019
KTH Royal Institute of Technology
2013-2014
International Institute of Information Technology, Hyderabad
2012
This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in applications, such as ASR, but generally considered unusable synthesis. First, we predict fundamental and voicing information MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope contained is converted to all-pole filters, pitch-synchronous excitation model matched these filters trained. Finally, introduce generative adversarial...
Recently, generative neural network models which operate directly on raw audio, such as WaveNet, have improved the state of art in text-to-speech synthesis (TTS). Moreover, there is increasing interest using these statistical vocoders for generating speech waveforms from various acoustic features. However, also a need to reduce model complexity, without compromising quality. Previously, glottal pulseforms (i.e., time-domain corresponding source human voice production mechanism) been...
Recent studies have shown that text-to-speech synthesis quality can be improved by using glottal vocoding. This refers to vocoders parameterize speech into two parts, the excitation and vocal tract, occur in human production apparatus. Current generate waveform deep neural networks (DNNs). However, squared error-based training of present models is limited generating conditional average waveforms, which fails capture stochastic variation waveforms. As a result, shaped noise added as...
A vocoder is used to express a speech waveform with controllable parametric representation that can be converted back into waveform. Vocoders representing their main categories (mixed excitation, glottal, and sinusoidal vocoders) were compared in this study formal crowd-sourced listening tests. The quality was measured within the context of analysis-synthesis as well text-to-speech (TTS) synthesis modern statistical framework. Furthermore, TTS experiments divided vocoder-specific features...
Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating waveforms from acoustic features.These models have been shown to improve the generated quality over classical vocoders many tasks, such text-to-speech synthesis and voice conversion.Furthermore, conditioning with features allows sharing waveform generator model across multiple speakers without additional speaker codes.However, multi-speaker WaveNet require large amounts...
GlottHMM is a previously developed vocoder that has been successfully used in HMM-based synthesis by parameterizing speech into two parts (glottal flow, vocal tract) according to the functioning of real human voice production mechanism. In this study, new glottal vocoding method, GlottDNN, proposed. The GlottDNN built on principles its predecessor, GlottHMM, but introduces three main improvements: (1) takes advantage new, more accurate inverse filtering (2) uses method deep neural network...
for agreeing to read the book in detail, as well all past discussions over years and many come.I do not understand how we have so far managed avoid writing a paper together.Thanks friends colleagues at various Aalto Speech groups, Acoustics laboratory, National Institute of Informatics Tokyo.Working such diverse nourishing environment has resulted some very fruitful cross-pollination ideas across our research disciplines.
Achieving high quality and naturalness in statistical parametric synthesis of female voices remains to be difficult despite recent advances the study area. Vocoding is one such key element all speech synthesizers that known affect naturalness. The present focuses on a special type vocoding, glottal vocoders, which aim parameterize based modelling real excitation (voiced) speech, flow. More specifically, we compare three different vocoders by aiming at improved voices. Two are previously...
The state-of-the-art in text-to-speech (TTS) synthesis has recently improved considerably due to novel neural waveform generation methods, such as WaveNet. However, these methods suffer from their slow sequential inference process, while parallel versions are difficult train and even more computationally expensive. Meanwhile, generative adversarial networks (GANs) have achieved impressive results image making way into audio applications; is among lucrative properties. By adopting recent...
CSMAPLR Constrained structured maximum a posterior linear regression CSS
In statistical parametric speech synthesis (SPSS), a few studies have investigated the Lombard effect, specifically by using hidden Markov model (HMM)-based systems. Recently, artificial neural networks demonstrated promising results in SPSS, long short-term memory recurrent (LSTMs). The however, has not been studied LSTM-based this study, we propose three methods for adaptation synthesis. particular, (1) augment specific information with linguistic features as input, (2) scale activations...
The objective of this paper is to find the fundamental difference between breathy and modal voices based on differences in speech production as reflected signal. We propose signal processing methods for analyzing phonation voice. These include technique zero-frequency filtering, loudness measurement, computation periodic aperiodic energy ratio extraction formants their amplitudes using group-delay technique. Parameters derived these capture excitation source characteristics which play a...
Text-to-Speech synthesis in Indian languages has a seen lot of progress over the decade partly due to annual Blizzard challenges.These systems assume text be written Devanagari or Dravidian scripts which are nearly phonemic orthography scripts.However, most common form computer interaction among Indians is ASCII transliterated text.Such generally noisy with many variations spelling for same word.In this paper we evaluate three approaches synthesize speech from such text: naive Uni-Grapheme...
In this paper, we propose modeling a noisy-channel for the task of voice conversion (VC). We have used artificial neural networks (ANN) to capture speaker-specific characteristics target speaker which avoid need any training utterance from source speaker. use articulatory features (AFs) as canonical form or speaker-independent representation speech signal. Our studies show that AFs contain significant amount information in their trajectories. Suitable techniques are proposed normalize AF...
This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is in-crease diversity of text conditionings available during training. helps reduce overfitting, especially in low-resource settings. method relies on substituting and audio fragments way preserves syntactical correctness. We take measures ensure synthesized speech does not contain artifacts caused by...
Linear prediction (LP) is a prevalent source-filter separation method of speech production. One the drawbacks conventional LP-based approaches biasing estimated formants by harmonic peaks. Methods such as discrete all-pole modeling and weighted LP have been proposed to overcome this problem, but they all use linear frequency scale. This study proposes new technique, frequency-warped time-weighted (WWLP), provide spectral envelope estimates robust peaks that work on warped scale approximates...
This paper presents an experimental comparison of various leading vocoders for the application HMM-based laughter synthesis. Four vocoders, commonly used in speech synthesis, are copy-synthesis and synthesis both male female laughter. Subjective evaluations conducted to assess performance vocoders. The results show that all perform relatively well copy-synthesis. In using original phonetic transcriptions, synthesized voices were significantly lower quality than copy-synthesis, indicating a...
In this paper, we describe a project that explores novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction is designed, human-human dialogue corpus collected. The targets the development of system platform to study verbal nonverbal strategies in spoken interactions with robots which are capable dialogue. task centered on two participants involved aiming solve card-ordering game. Along sits tutor (robot) helps...
Neural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis.Until now, however, these trained separately the model.This creates mismatch between training and synthesis, as synthesized used for model input differ original inputs, with which was on.Furthermore, due errors predicting vocal tract filter, do not provide perfect reconstruction of waveform even if predicted without...