- Speech Recognition and Synthesis
- Music and Audio Processing
- Speech and Audio Processing
- Speech and dialogue systems
- Topic Modeling
- Phonetics and Phonology Research
- Natural Language Processing Techniques
- Voice and Speech Disorders
- Model Reduction and Neural Networks
- Neural Networks and Applications
- Robotics and Automated Systems
- Second Language Acquisition and Learning
- Digital Accessibility for Disabilities
- Assistive Technology in Communication and Mobility
- Discourse Analysis and Cultural Communication
- Neural Networks and Reservoir Computing
Amazon (United States)
2019-2023
Gdańsk University of Technology
2020-2022
University of Zurich
2022
Boston College
2022
American Jewish Committee
2022
Rochester Institute of Technology
2022
Université Grenoble Alpes
2022
University of New Hampshire
2022
University of Groningen
2022
Nuance Communications (Austria)
2022
We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our offers real-time high-quality speech synthesis wide range of use cases. tested it 43 internal speakers diverse age and gender, speaking 20 languages in 17 unique styles, which 7 voices 5 styles were not exposed during training. show that the proposed significantly outperforms speaker-dependent vocoders overall. also several existing architectures terms...
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, well the analysis of different representations speech signal. Despite significant progress recent years, existing CAPT are not able to detect errors with high accuracy (only 60\% precision at 40\%-80\% recall). One key problems is low availability mispronounced...
We present a novel deep learning model for the detection and reconstruction of dysarthric speech.We train with multi-task technique to jointly solve dysarthria speech tasks.The key feature is low-dimensional latent space that meant encode properties speech.It commonly believed neural networks are black boxes problems but do not provide interpretable outputs.On contrary, we show this successfully encodes characteristics dysarthria, effective at detecting manipulation allows reconstruct...
Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, 3-step method was proposed to generate TTS while greatly reducing data required for training. However, we have observed ceiling effect in level naturalness achievable highly expressive voices when using this approach. paper, present building with as little 15 minutes speech Compared current state-of-the-art approach,...
A common approach to the automatic detection of mispronunciation in language learning is recognize phonemes produced by a student and compare it expected pronunciation native speaker. This makes two simplifying assumptions: a) can be recognized from speech with high accuracy, b) there single correct way for sentence pronounced. These assumptions do not always hold, which result significant amount false alarms. We propose novel overcome this problem based on principles: taking into account...
Creating realistic and natural-sounding synthetic speech remains a big challenge for voice identities unseen during training. As there is growing interest in synthesizing voices of new speakers, here we investigate the ability normalizing flows text-to-speech (TTS) conversion (VC) modes to extrapolate from speakers observed training create speaker identities. Firstly, an approach TTS VC, then comprehensively evaluate our methods baselines terms intelligibility, naturalness, similarity,...
We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions means that the has learn from weak signal mispronunciations. Because due limited amount speech, more likely overfit. limit risk, it multi-task setup. In first task, estimate probabilities mispronunciation. For second use phoneme...
Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information dropped whilst all other from speech retained a large challenge. This particularly challenging in scenario where at inference-time we have no knowledge text being read, i.e., text-free VC. To mitigate this, investigate information-preserving VC approaches.Normalising flows gained attention for text-to-speech synthesis, however been...
Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical (SSWS) system. An in-depth evaluation of SSWS is conducted across a number domains to better understand consistency The results this are validated by repeating procedure on separate group testers. Finally, an analysis nature errors compared hybrid unit selection identify strengths and weaknesses SSWS. Having deeper...
This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS).In a classical approach, audio features are usually extracted from fixed regions speech such as syllable nucleus.We propose an deep learning model automatically derives optimal syllable-level representation frame-level phoneme-level features.Training this is...
In expressive speech synthesis it is widely adopted to use latent prosody representations deal with variability of the data during training. Same text may correspond various acoustic realizations, which known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level are extracted from target signal an auto-encoding setup, complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels granularity examines their...
Regional accents of the same language affect not only how words are pronounced (i.e., phonetic content), but also impact prosodic aspects speech such as speaking rate and intonation. This paper investigates a novel flow-based approach to accent conversion using normalizing flows. The proposed revolves around three steps: remapping conditioning, better match target accent, warping duration converted speech, suit phonemes, an attention mechanism that implicitly aligns source sequences....
The purpose of the recordings was to create a speech corpus based on ISLE dataset, extended with video and Lombard speech. Selected from set 165 sentences, 10, evaluated as having highest possibility occur in context effect, were repeated presence so-called babble obtain features. Altogether, 15 speakers recorded, parameters calculated analyzed. First, brief summary research related effect is given. Then, recording studio characteristics equipment utilized for are shown. Examples analyses...
Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable producing with similar quality to human recordings.However, not all speaking styles easy model: highly expressive voices still challenging even TTS architectures since there seems be trade-off between expressiveness generated audio and its signal quality.In this paper, we present set techniques that can leveraged enhance the highly-expressive voice without use...
Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information dropped whilst all other from speech retained a large challenge. This particularly challenging in scenario where at inference-time we have no knowledge text being read, i.e., text-free VC. To mitigate this, investigate information-preserving VC approaches. Normalising flows gained attention for text-to-speech synthesis, however been...
In this paper, we present a novel system to practice lexical stress in L2 English learning with Amazon Alexa home assistant. The language for non-native speakers mostly focuses on practicing correct grammar, extending vocabulary, and improving pronunciation. proposed enables person skills at by having conversations assesses student's abilities enunciate words automatically selects the next practice. After series of exercises, informs student improvement. main scientific contribution work...
Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable producing with similar quality to human recordings. However, not all speaking styles easy model: highly expressive voices still challenging even TTS architectures since there seems be trade-off between expressiveness generated audio and its signal quality. In this paper, we present set techniques that can leveraged enhance the highly-expressive voice without use...
We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our offers real-time high-quality speech synthesis wide range of use cases. tested it 43 internal speakers diverse age and gender, speaking 20 languages in 17 unique styles, which 7 voices 5 styles were not exposed during training. show that the proposed significantly outperforms speaker-dependent vocoders overall. also several existing architectures terms...
Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, 3-step method was proposed to generate TTS while greatly reducing data required for training. However, we have observed ceiling effect in level naturalness achievable highly expressive voices when using this approach. paper, present building with as little 15 minutes speech Compared current state-of-the-art approach,...
In expressive speech synthesis it is widely adopted to use latent prosody representations deal with variability of the data during training. Same text may correspond various acoustic realizations, which known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level are extracted from target signal an auto-encoding setup, complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels granularity examines their...