- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Speech and dialogue systems
- Advanced Adaptive Filtering Techniques
- Natural Language Processing Techniques
- Topic Modeling
- Neural Networks and Applications
- Phonetics and Phonology Research
- Advanced Data Compression Techniques
- Stochastic Gradient Optimization Techniques
- Music Technology and Sound Studies
- Image and Signal Denoising Methods
- Robotics and Automated Systems
- Machine Learning and ELM
- Text and Document Classification Technologies
- Advanced Neural Network Applications
- Social Robot Interaction and HRI
- Domain Adaptation and Few-Shot Learning
Amazon (United States)
2016-2022
Seattle University
2022
Amazon (Germany)
2018-2019
KTH Royal Institute of Technology
1997-2009
We introduce a new method for scaling up distributed Stochastic Gradient Descent (SGD) training of Deep Neural Networks (DNN). The solves the well-known communication bottleneck problem that arises data-parallel SGD because compute nodes frequently need to synchronize replica model. solve it by purposefully controlling rate weight-update per individual weight, which is in contrast uniform update-rate customarily imposed size mini-batch. It shown empirically can reduce amount three orders...
We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The can be further guided by initializing cross-entropy trained network. A posterior smoothing evaluation approach is employed to measure performance. Our experimental results show that LSTM models using or outperform baseline feed-forward Deep Neural Network (DNN). In addition, randomly initialized network...
This is a report of our lessons learned building acoustic models from 1 Million hours unlabeled speech, while labeled speech restricted to 7,000 hours. We employ student/teacher training on data, helping scale out target generation in comparison confidence model based methods, which require decoder and model. To optimize storage parallelize generation, we store high valued logits the teacher Introducing notion scheduled learning, interleave learning data. distributed across large number...
In this work, we develop a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance. Conventional recognition systems typically extract compact feature representation based on prior knowledge such as log-mel filter bank energy (LFBE). Such is then used deep neural network (DNN) acoustic model (AM). contrast, train WW DNN AM audio data stage-wise manner. We first build extraction with small hidden bottleneck...
This paper presents a novel deep neural network (DNN) architecture with highway blocks (HWs) using complex discrete Fourier transform (DFT) feature for keyword spotting. In our previous work, we showed that the feed-forward DNN time-delayed bottleneck layer (TDB-DNN) directly trained from audio input outperformed model log-mel filter bank energy (LFBE), given large amount of training data [1]. However, deeper structure such an makes optimization problem more difficult, which could easily...
Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for enhancement in order to improve robustness against noise or reverberation. However, such do not always yield ASR accuracy improvement because the optimization criterion is directly relevant objective. In this work, we develop new acoustic modeling that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an directly. contrast...
In the past, conventional i-vectors based on a Universal Background Model (UBM) have been successfully used as input features to adapt Deep Neural Network (DNN) Acoustic (AM) for Automatic Speech Recognition (ASR). contrast, this paper introduces Hidden Markov (HMM) ivectors that use HMM state alignment information from an ASR system estimating i-vectors. Further, we propose passing these though explicit non-linear hidden layer of DNN before combining them with standard acoustic features,...
We investigate the problem of speaker adaptation DNN acoustic models in two settings: traditional unsupervised and a supervised (SuA) where few minutes transcribed speech is available. SuA presents additional difficulties when test speaker’s information does not match registered information. Employing feature-space maximum likelihood linear regression (fMLLR) transformed features as side-information to DNN, we reintroduce some classical ideas for combining adapted unadapted features: early...
Accurate on-device keyword spotting (KWS) with low false accept and reject rate is crucial to customer experience for far-field voice control of conversational agents. It particularly challenging maintain in real world conditions where there (a) ambient noise from external sources such as TV, household appliances, or other speech that not directed at the device (b) imperfect cancellation audio playback device, resulting residual echo, after being processed by Acoustic Echo Cancellation (AEC)...
This paper presents new methods for training large neural networks phoneme probability estimation. A combination of the time-delay architecture and recurrent network is used to capture important dynamic information speech signal. Motivated by fact that number connections in fully connected grows super-linear with hidden units, schemes sparse connection pruning are explored. It found sparsely outperform their counterparts an equal or smaller connections. The evaluated a hybrid HMM/ANN system...
The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade enhancement performance when there is an geometry mismatch between design and test conditions. Moreover, such do not always yield ASR accuracy improvement due to the difference optimization objectives. In this work, we propose unify acoustic model framework by optimizing filtering long short-term memory (LSTM)...
Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical (SSWS) system. An in-depth evaluation of SSWS is conducted across a number domains to better understand consistency The results this are validated by repeating procedure on separate group testers. Finally, an analysis nature errors compared hybrid unit selection identify strengths and weaknesses SSWS. Having deeper...
Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead scaling manual transcription impractical levels, we utilize semi-supervised (SSL) learn acoustic models (AM) from vast firehose untranscribed audio Learning an AM 1 Million hours presents unique ML and design challenges. We present evaluation a highly scalable resource efficient SSL for AM. Employing...
Anish Acharya, Suranjit Adhikari, Sanchit Agarwal, Vincent Auvray, Nehal Belgamwar, Arijit Biswas, Shubhra Chandra, Tagyoung Chung, Maryam Fazel-Zarandi, Raefer Gabriel, Shuyang Gao, Rahul Goel, Dilek Hakkani-Tur, Jan Jezabek, Abhay Jha, Jiun-Yu Kao, Prakash Krishnan, Peter Ku, Anuj Goyal, Chien-Wei Lin, Qing Liu, Arindam Mandal, Angeliki Metallinou, Vishal Naik, Yi Pan, Shachi Paul, Vittorio Perera, Abhishek Sethi, Minmin Shen, Nikko Strom, Eddie Wang. Proceedings of the 2021 Conference...
This paper presents our work on building a small-footprint keyword spotting system for resource-limited language, which requires low CPU, memory and latency. Our consists of deep neural network (DNN) hidden Markov model (HMM), is hybrid DNN-HMM decoder. We investigate different transfer learning techniques to leverage knowledge data from resource-abundant source language improve the DNN training target has limited in-domain data. The approaches employed in this include using initialize...
A method for unsupervised instantaneous speaker adaptation is presented and evaluated on a continuous speech recognition task in man-machine dialogue system. The based modeling of the systematic variation. variation modeled by low-dimensional space classification segments conditioned position space. Because effect determined an off-line training procedure using speakers database, complex can be modeled. Speaker achieved only constraint that constant over each utterance. Therefore, no...
This paper describes an application database collected in Wizard-of-Oz experiments a spoken dialogue system, WAXHOLM. The system provides information on boat traffic the Stockholm archipelago ...