Nikko Ström

ORCID: 0000-0002-9295-7859
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech and dialogue systems
  • Advanced Adaptive Filtering Techniques
  • Natural Language Processing Techniques
  • Topic Modeling
  • Neural Networks and Applications
  • Phonetics and Phonology Research
  • Advanced Data Compression Techniques
  • Stochastic Gradient Optimization Techniques
  • Music Technology and Sound Studies
  • Image and Signal Denoising Methods
  • Robotics and Automated Systems
  • Machine Learning and ELM
  • Text and Document Classification Technologies
  • Advanced Neural Network Applications
  • Social Robot Interaction and HRI
  • Domain Adaptation and Few-Shot Learning

Amazon (United States)
2016-2022

Seattle University
2022

Amazon (Germany)
2018-2019

KTH Royal Institute of Technology
1997-2009

We introduce a new method for scaling up distributed Stochastic Gradient Descent (SGD) training of Deep Neural Networks (DNN). The solves the well-known communication bottleneck problem that arises data-parallel SGD because compute nodes frequently need to synchronize replica model. solve it by purposefully controlling rate weight-update per individual weight, which is in contrast uniform update-rate customarily imposed size mini-batch. It shown empirically can reduce amount three orders...

10.21437/interspeech.2015-354 article EN Interspeech 2022 2015-09-06

We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The can be further guided by initializing cross-entropy trained network. A posterior smoothing evaluation approach is employed to measure performance. Our experimental results show that LSTM models using or outperform baseline feed-forward Deep Neural Network (DNN). In addition, randomly initialized network...

10.1109/slt.2016.7846306 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2016-12-01

This is a report of our lessons learned building acoustic models from 1 Million hours unlabeled speech, while labeled speech restricted to 7,000 hours. We employ student/teacher training on data, helping scale out target generation in comparison confidence model based methods, which require decoder and model. To optimize storage parallelize generation, we store high valued logits the teacher Introducing notion scheduled learning, interleave learning data. distributed across large number...

10.1109/icassp.2019.8683690 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

In this work, we develop a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance. Conventional recognition systems typically extract compact feature representation based on prior knowledge such as log-mel filter bank energy (LFBE). Such is then used deep neural network (DNN) acoustic model (AM). contrast, train WW DNN AM audio data stage-wise manner. We first build extraction with small hidden bottleneck...

10.1109/asru.2017.8268943 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017-12-01

This paper presents a novel deep neural network (DNN) architecture with highway blocks (HWs) using complex discrete Fourier transform (DFT) feature for keyword spotting. In our previous work, we showed that the feed-forward DNN time-delayed bottleneck layer (TDB-DNN) directly trained from audio input outperformed model log-mel filter bank energy (LFBE), given large amount of training data [1]. However, deeper structure such an makes optimization problem more difficult, which could easily...

10.1109/icassp.2018.8462166 article EN 2018-04-01

Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for enhancement in order to improve robustness against noise or reverberation. However, such do not always yield ASR accuracy improvement because the optimization criterion is directly relevant objective. In this work, we develop new acoustic modeling that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an directly. contrast...

10.1109/icassp.2019.8682977 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

In the past, conventional i-vectors based on a Universal Background Model (UBM) have been successfully used as input features to adapt Deep Neural Network (DNN) Acoustic (AM) for Automatic Speech Recognition (ASR). contrast, this paper introduces Hidden Markov (HMM) ivectors that use HMM state alignment information from an ASR system estimating i-vectors. Further, we propose passing these though explicit non-linear hidden layer of DNN before combining them with standard acoustic features,...

10.21437/interspeech.2015-605 article EN Interspeech 2022 2015-09-06

We investigate the problem of speaker adaptation DNN acoustic models in two settings: traditional unsupervised and a supervised (SuA) where few minutes transcribed speech is available. SuA presents additional difficulties when test speaker’s information does not match registered information. Employing feature-space maximum likelihood linear regression (fMLLR) transformed features as side-information to DNN, we reintroduce some classical ideas for combining adapted unadapted features: early...

10.21437/interspeech.2015-720 article EN Interspeech 2022 2015-09-06

Accurate on-device keyword spotting (KWS) with low false accept and reject rate is crucial to customer experience for far-field voice control of conversational agents. It particularly challenging maintain in real world conditions where there (a) ambient noise from external sources such as TV, household appliances, or other speech that not directed at the device (b) imperfect cancellation audio playback device, resulting residual echo, after being processed by Acoustic Echo Cancellation (AEC)...

10.48550/arxiv.1808.00563 preprint EN other-oa arXiv (Cornell University) 2018-01-01

This paper presents new methods for training large neural networks phoneme probability estimation. A combination of the time-delay architecture and recurrent network is used to capture important dynamic information speech signal. Motivated by fact that number connections in fully connected grows super-linear with hidden units, schemes sparse connection pruning are explored. It found sparsely outperform their counterparts an equal or smaller connections. The evaluated a hybrid HMM/ANN system...

10.21437/eurospeech.1997-708 article EN 1997-09-22

The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade enhancement performance when there is an geometry mismatch between design and test conditions. Moreover, such do not always yield ASR accuracy improvement due to the difference optimization objectives. In this work, we propose unify acoustic model framework by optimizing filtering long short-term memory (LSTM)...

10.1109/icassp.2019.8682294 preprint EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

10.21437/icslp.2000-353 article EN 4th International Conference on Spoken Language Processing (ICSLP 1996) 2000-10-16

Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical (SSWS) system. An in-depth evaluation of SSWS is conducted across a number domains to better understand consistency The results this are validated by repeating procedure on separate group testers. Finally, an analysis nature errors compared hybrid unit selection identify strengths and weaknesses SSWS. Having deeper...

10.1109/slt.2018.8639556 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead scaling manual transcription impractical levels, we utilize semi-supervised (SSL) learn acoustic models (AM) from vast firehose untranscribed audio Learning an AM 1 Million hours presents unique ML and design challenges. We present evaluation a highly scalable resource efficient SSL for AM. Employing...

10.1109/jetcas.2019.2912353 article EN IEEE Journal on Emerging and Selected Topics in Circuits and Systems 2019-04-25

Anish Acharya, Suranjit Adhikari, Sanchit Agarwal, Vincent Auvray, Nehal Belgamwar, Arijit Biswas, Shubhra Chandra, Tagyoung Chung, Maryam Fazel-Zarandi, Raefer Gabriel, Shuyang Gao, Rahul Goel, Dilek Hakkani-Tur, Jan Jezabek, Abhay Jha, Jiun-Yu Kao, Prakash Krishnan, Peter Ku, Anuj Goyal, Chien-Wei Lin, Qing Liu, Arindam Mandal, Angeliki Metallinou, Vishal Naik, Yi Pan, Shachi Paul, Vittorio Perera, Abhishek Sethi, Minmin Shen, Nikko Strom, Eddie Wang. Proceedings of the 2021 Conference...

10.18653/v1/2021.naacl-demos.15 article EN 2021-01-01

This paper presents our work on building a small-footprint keyword spotting system for resource-limited language, which requires low CPU, memory and latency. Our consists of deep neural network (DNN) hidden Markov model (HMM), is hybrid DNN-HMM decoder. We investigate different transfer learning techniques to leverage knowledge data from resource-abundant source language improve the DNN training target has limited in-domain data. The approaches employed in this include using initialize...

10.1109/icmla.2017.0-150 article EN 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) 2017-12-01

A method for unsupervised instantaneous speaker adaptation is presented and evaluated on a continuous speech recognition task in man-machine dialogue system. The based modeling of the systematic variation. variation modeled by low-dimensional space classification segments conditioned position space. Because effect determined an off-line training procedure using speakers database, complex can be modeled. Speaker achieved only constraint that constant over each utterance. Therefore, no...

10.1109/icslp.1996.607769 article EN 2002-12-24

This paper describes an application database collected in Wizard-of-Oz experiments a spoken dialogue system, WAXHOLM. The system provides information on boat traffic the Stockholm archipelago ...

10.21437/eurospeech.1995-190 article EN 1995-09-18
Coming Soon ...