NFDI4DS | UHH-SEMS - Publication Details

Nikko Ström

ORCID: 0000-0002-9295-7859

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5028062647

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Speech and dialogue systems
Advanced Adaptive Filtering Techniques
Natural Language Processing Techniques
Topic Modeling
Neural Networks and Applications
Phonetics and Phonology Research
Advanced Data Compression Techniques
Stochastic Gradient Optimization Techniques
Music Technology and Sound Studies
Image and Signal Denoising Methods
Robotics and Automated Systems
Machine Learning and ELM
Text and Document Classification Technologies
Advanced Neural Network Applications
Social Robot Interaction and HRI
Domain Adaptation and Few-Shot Learning

Amazon (United States)
2016-2022

Seattle University
2022

Amazon (Germany)
2018-2019

KTH Royal Institute of Technology
1997-2009

Scalable distributed DNN training using commodity GPU cloud computing

OPENALEX - Publications

Nikko Ström

We introduce a new method for scaling up distributed Stochastic Gradient Descent (SGD) training of Deep Neural Networks (DNN). The solves the well-known communication bottleneck problem that arises data-parallel SGD because compute nodes frequently need to synchronize replica model. solve it by purposefully controlling rate weight-update per individual weight, which is in contrast uniform update-rate customarily imposed size mini-batch. It shown empirically can reduce amount three orders...

10.21437/interspeech.2015-354 article EN Interspeech 2022 2015-09-06

Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting

OPENALEX - Publications

Ming Sun Anirudh Raju George Tucker Sankaran Panchapagesan Geng-Shen Fu and 4 more

We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The can be further guided by initializing cross-entropy trained network. A posterior smoothing evaluation approach is employed to measure performance. Our experimental results show that LSTM models using or outperform baseline feed-forward Deep Neural Network (DNN). In addition, randomly initialized network...

10.1109/slt.2016.7846306 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2016-12-01

Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting

OPENALEX - Publications

Ming Sun David Snyder Yixin Gao Varun Nagaraja Mike Rodehorst and 4 more

10.21437/interspeech.2017-480 article EN Interspeech 2022 2017-08-16

Lessons from Building Acoustic Models with a Million Hours of Speech

OPENALEX - Publications

Sree Hari Krishnan Parthasarathi Nikko Ström

This is a report of our lessons learned building acoustic models from 1 Million hours unlabeled speech, while labeled speech restricted to 7,000 hours. We employ student/teacher training on data, helping scale out target generation in comparison confidence model based methods, which require decoder and model. To optimize storage parallelize generation, we store high valued logits the teacher Introducing notion scheduled learning, interleave learning data. distributed across large number...

10.1109/icassp.2019.8683690 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Direct modeling of raw audio with DNNS for wake word detection

OPENALEX - Publications

Kenichi Kumatani Sankaran Panchapagesan Minhua Wu Minjae Kim Nikko Ström and 2 more

In this work, we develop a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance. Conventional recognition systems typically extract compact feature representation based on prior knowledge such as log-mel filter bank energy (LFBE). Such is then used deep neural network (DNN) acoustic model (AM). contrast, train WW DNN AM audio data stage-wise manner. We first build extraction with small hidden bottleneck...

10.1109/asru.2017.8268943 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017-12-01

Time-Delayed Bottleneck Highway Networks Using a DFT Feature for Keyword Spotting

OPENALEX - Publications

Jinxi Guo Kenichi Kumatani Ming Sun Minhua Wu Anirudh Raju and 2 more

This paper presents a novel deep neural network (DNN) architecture with highway blocks (HWs) using complex discrete Fourier transform (DFT) feature for keyword spotting. In our previous work, we showed that the feed-forward DNN time-delayed bottleneck layer (TDB-DNN) directly trained from audio input outperformed model log-mel filter bank energy (LFBE), given large amount of training data [1]. However, deeper structure such an makes optimization problem more difficult, which could easily...

10.1109/icassp.2018.8462166 article EN 2018-04-01

Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition

OPENALEX - Publications

Minhua Wu Kenichi Kumatani Shiva Sundaram Nikko Ström Björn Hoffmeister

Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for enhancement in order to improve robustness against noise or reverberation. However, such do not always yield ASR accuracy improvement because the optimization criterion is directly relevant objective. In this work, we develop new acoustic modeling that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an directly. contrast...

10.1109/icassp.2019.8682977 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Robust i-vector based adaptation of DNN acoustic model for speech recognition

OPENALEX - Publications

Sri Garimella Arindam Mandal Nikko Ström Björn Hoffmeister Spyros Matsoukas and 1 more

In the past, conventional i-vectors based on a Universal Background Model (UBM) have been successfully used as input features to adapt Deep Neural Network (DNN) Acoustic (AM) for Automatic Speech Recognition (ASR). contrast, this paper introduces Hidden Markov (HMM) ivectors that use HMM state alignment information from an ASR system estimating i-vectors. Further, we propose passing these though explicit non-linear hidden layer of DNN before combining them with standard acoustic features,...

10.21437/interspeech.2015-605 article EN Interspeech 2022 2015-09-06

fMLLR based feature-space speaker adaptation of DNN acoustic models

OPENALEX - Publications

Sree Hari Krishnan Parthasarathi Björn Hoffmeister Spyros Matsoukas Arindam Mandal Nikko Ström and 1 more

We investigate the problem of speaker adaptation DNN acoustic models in two settings: traditional unsupervised and a supervised (SuA) where few minutes transcribed speech is available. SuA presents additional difficulties when test speaker’s information does not match registered information. Employing feature-space maximum likelihood linear regression (fMLLR) transformed features as side-information to DNN, we reintroduce some classical ideas for combining adapted unadapted features: early...

10.21437/interspeech.2015-720 article EN Interspeech 2022 2015-09-06

Data Augmentation for Robust Keyword Spotting under Playback Interference

OPENALEX - Publications

Anirudh Raju Sankaran Panchapagesan Xing Liu Arindam Mandal Nikko Ström

Accurate on-device keyword spotting (KWS) with low false accept and reject rate is crucial to customer experience for far-field voice control of conversational agents. It particularly challenging maintain in real world conditions where there (a) ambient noise from external sources such as TV, household appliances, or other speech that not directed at the device (b) imperfect cancellation audio playback device, resulting residual echo, after being processed by Acoustic Echo Cancellation (AEC)...

10.48550/arxiv.1808.00563 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Sparse connection and pruning in large dynamic artificial neural networks

OPENALEX - Publications

Nikko Ström

This paper presents new methods for training large neural networks phoneme probability estimation. A combination of the time-delay architecture and recurrent network is used to capture important dynamic information speech signal. Motivated by fact that number connections in fully connected grows super-linear with hidden units, schemes sparse connection pruning are explored. It found sparsely outperform their counterparts an equal or smaller connections. The evaluated a hybrid HMM/ANN system...

10.21437/eurospeech.1997-708 article EN 1997-09-22

Multi-geometry Spatial Acoustic Modeling for Distant Speech Recognition

OPENALEX - Publications

Kenichi Kumatani Minhua Wu Shiva Sundaram Nikko Ström Björn Hoffmeister

The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade enhancement performance when there is an geometry mismatch between design and test conditions. Moreover, such do not always yield ASR accuracy improvement due to the difference optimization objectives. In this work, we propose unify acoustic model framework by optimizing filtering long short-term memory (LSTM)...

10.1109/icassp.2019.8682294 preprint EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Intelligent barge-in in conversational systems

OPENALEX - Publications

Nikko Ström Stephanie Seneff

10.21437/icslp.2000-353 article EN 4th International Conference on Spoken Language Processing (ICSLP 1996) 2000-10-16

Comprehensive Evaluation of Statistical Speech Waveform Synthesis

OPENALEX - Publications

Thomas Merritt Bartosz Putrycz Adam Nadolski Tianjun Ye Daniel Korzekwa and 8 more

Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical (SSWS) system. An in-depth evaluation of SSWS is conducted across a number domains to better understand consistency The results this are validated by repeating procedure on separate group testers. Finally, an analysis nature errors compared hybrid unit selection identify strengths and weaknesses SSWS. Having deeper...

10.1109/slt.2018.8639556 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

Realizing Petabyte Scale Acoustic Modeling

OPENALEX - Publications

Sree Hari Krishnan Parthasarathi Nitin Sivakrishnan Pranav Ladkat Nikko Ström

Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead scaling manual transcription impractical levels, we utilize semi-supervised (SSL) learn acoustic models (AM) from vast firehose untranscribed audio Learning an AM 1 Million hours presents unique ML and design challenges. We present evaluation a highly scalable resource efficient SSL for AM. Employing...

10.1109/jetcas.2019.2912353 article EN IEEE Journal on Emerging and Selected Topics in Circuits and Systems 2019-04-25

Alexa Conversations: An Extensible Data-driven Approach for Building Task-oriented Dialogue Systems

OPENALEX - Publications

Anish Acharya Suranjit Adhikari Sanchit Agarwal Vincent Auvray Nehal Belgamwar and 26 more

Anish Acharya, Suranjit Adhikari, Sanchit Agarwal, Vincent Auvray, Nehal Belgamwar, Arijit Biswas, Shubhra Chandra, Tagyoung Chung, Maryam Fazel-Zarandi, Raefer Gabriel, Shuyang Gao, Rahul Goel, Dilek Hakkani-Tur, Jan Jezabek, Abhay Jha, Jiun-Yu Kao, Prakash Krishnan, Peter Ku, Anuj Goyal, Chien-Wei Lin, Qing Liu, Arindam Mandal, Angeliki Metallinou, Vishal Naik, Yi Pan, Shachi Paul, Vittorio Perera, Abhishek Sethi, Minmin Shen, Nikko Strom, Eddie Wang. Proceedings of the 2021 Conference...

10.18653/v1/2021.naacl-demos.15 article EN 2021-01-01

An Empirical Study of Cross-Lingual Transfer Learning Techniques for Small-Footprint Keyword Spotting

OPENALEX - Publications

Ming Sun Andreas Schwarz Minhua Wu Nikko Ström Spyros Matsoukas and 1 more

This paper presents our work on building a small-footprint keyword spotting system for resource-limited language, which requires low CPU, memory and latency. Our consists of deep neural network (DNN) hidden Markov model (HMM), is hybrid DNN-HMM decoder. We investigate different transfer learning techniques to leverage knowledge data from resource-abundant source language improve the DNN training target has limited in-domain data. The approaches employed in this include using initialize...

10.1109/icmla.2017.0-150 article EN 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) 2017-12-01

Speaker adaptation by modeling the speaker variation in a continuous speech recognition system

OPENALEX - Publications

Nikko Ström

A method for unsupervised instantaneous speaker adaptation is presented and evaluated on a continuous speech recognition task in man-machine dialogue system. The based modeling of the systematic variation. variation modeled by low-dimensional space classification segments conditioned position space. Because effect determined an off-line training procedure using speakers database, complex can be modeled. Speaker achieved only constraint that constant over each utterance. Therefore, no...

10.1109/icslp.1996.607769 article EN 2002-12-24

The waxholm application database

OPENALEX - Publications

Johan Bertenstam Mats Blomberg Rolf Carlson Kjell Elenius Björn Granström and 8 more

This paper describes an application database collected in Wizard-of-Oz experiments a spoken dialogue system, WAXHOLM. The system provides information on boat traffic the Stockholm archipelago ...

10.21437/eurospeech.1995-190 article EN 1995-09-18

Coming Soon ...