NFDI4DS | UHH-SEMS - Publication Details

Jasha Droppo

ORCID: 0000-0001-6097-0090

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5012153296

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Natural Language Processing Techniques
Nuclear and radioactivity studies
Speech and dialogue systems
Graphite, nuclear technology, radiation studies
Topic Modeling
Advanced Adaptive Filtering Techniques
Advanced Data Compression Techniques
Neural Networks and Applications
Radioactive contamination and transfer
Radioactivity and Radon Measurements
Blind Source Separation Techniques
Risk and Safety Analysis
Machine Fault Diagnosis Techniques
Atmospheric chemistry and aerosols
Music Technology and Sound Studies
Wind and Air Flow Studies
Meteorological Phenomena and Simulations
Underwater Acoustics Research
Phonetics and Phonology Research
Domain Adaptation and Few-Shot Learning
Image and Signal Denoising Methods
Voice and Speech Disorders

Amazon (United States)
2020-2024

Amazon (Germany)
2021-2022

Seattle University
2021

Microsoft (United States)
2010-2019

Microsoft Research (United Kingdom)
2001-2018

Shanghai Jiao Tong University
2017

Pacific Northwest National Laboratory
1981-2007

Phoenix Contact (United States)
2007

University of Washington
1997-2002

Office of Scientific and Technical Information
1996

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

OPENALEX - Publications

Frank Seide Hao Fu Jasha Droppo Gang Li Dong Yu

We show empirically that in SGD training of deep neural networks, one can, at no or nearly loss accuracy, quantize the gradients aggressively—to but bit per value—if quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize through data-parallelism with fast processors like recent GPUs. implement data-parallel deterministically distributed by combining this finding AdaGrad, automatic minibatch-size selection, double...

10.21437/interspeech.2014-274 article EN Interspeech 2022 2014-09-14

Achieving Human Parity in Conversational Speech Recognition

OPENALEX - Publications

Wayne Xiong Jasha Droppo Xuedong Huang Frank Seide Mike Seltzer and 3 more

Conversational speech recognition has served as a flagship task since the release of Switchboard corpus in 1990s. In this paper, we measure human error rate on widely used NIST 2000 test set, and find that our latest automated system reached parity. The professional transcribers is 5.9% for portion data, which newly acquainted pairs people discuss an assigned topic, 11.3% CallHome where friends family members have open-ended conversations. both cases, establishes new state art, edges past...

10.48550/arxiv.1610.05256 preprint EN other-oa arXiv (Cornell University) 2016-01-01

The Microsoft 2017 Conversational Speech Recognition System

OPENALEX - Publications

Wayne Xiong Liang‐Hong Wu F. Alleva Jasha Droppo Xuedong Huang and 1 more

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments neural-network-based acoustic and language modeling to further advance state art on Switchboard task. The adds a CNN-BLSTM model set architectures combined previously, includes character-based dialog session aware LSTM models rescoring. For combination adopt two-stage approach, whereby subsets are first at senone/frame level, followed by word-level...

10.1109/icassp.2018.8461870 preprint EN 2018-04-01

The microsoft 2016 conversational speech recognition system

OPENALEX - Publications

Wayne Xiong Jasha Droppo Xuedong Huang Frank Seide Michael L. Seltzer and 3 more

We describe Microsoft's conversational speech recognition system, in which we combine recent developments neural-network-based acoustic and language modeling to advance the state of art on Switchboard task. Inspired by machine learning ensemble techniques, system uses a range convolutional recurrent neural networks. I-vector lattice-free MMI training provide significant gains for all model architectures. Language rescoring with multiple forward backward running RNNLMs, word posterior-based...

10.1109/icassp.2017.7953159 preprint EN 2017-03-01

Multi-task learning in deep neural networks for improved phoneme recognition

OPENALEX - Publications

Michael L. Seltzer Jasha Droppo

In this paper we demonstrate how to improve the performance of deep neural network (DNN) acoustic models using multi-task learning. learning, is trained perform both primary classification task and one or more secondary tasks a shared representation. The additional model parameters associated with represent very small increase in number parameters, can be discarded at runtime. paper, explore three natural choices for task: phone label, context, state context. We that, even on strong...

10.1109/icassp.2013.6639012 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Toward Human Parity in Conversational Speech Recognition

OPENALEX - Publications

Wayne Xiong Jasha Droppo Xuedong Huang Frank Seide Michael L. Seltzer and 3 more

Conversational speech recognition has served as a flagship task since the release of Switchboard corpus in 1990s. In this paper, we measure human error rate on widely used NIST 2000 test set for commercial bulk transcription. The professional transcribers is 5.9% portion data, which newly acquainted pairs people discuss an assigned topic, and 11.3% CallHome portion, where friends family members have open-ended conversations. both cases, our automated system edges past benchmark, achieving...

10.1109/taslp.2017.2756440 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-09-25

Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition

OPENALEX - Publications

Chao Weng Dong Yu Michael L. Seltzer Jasha Droppo

We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains five key ingredients: a multi-style training strategy artificially mixed data, separate DNN to estimate senone posterior probabilities of louder and softer speakers at each frame, weighted finite-state transducer (WFST)-based two-talker decoder jointly correlate speaker speech, switching penalty estimated from energy pattern...

10.1109/taslp.2015.2444659 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2015-06-12

Advances in all-neural speech recognition

OPENALEX - Publications

Geoffrey Zweig Chengzhu Yu Jasha Droppo Andreas Stolcke

This paper advances the design of CTC-based all-neural (or end-to-end) speech recognizers. We propose a novel symbol inventory, and iterated-CTC method in which second system is used to transform noisy initial output into cleaner version. present number stabilization initialization methods we have found useful training these networks. evaluate our on commonly NIST 2000 conversational telephony test set, significantly exceed previously published performance similar systems, both with without...

10.1109/icassp.2017.7953069 preprint EN 2017-03-01

Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning

OPENALEX - Publications

Ritwik Giri Michael L. Seltzer Jasha Droppo Dong Yu

In this paper, we propose two approaches to improve deep neural network (DNN) acoustic models for speech recognition in reverberant environments. Both methods utilize auxiliary information training the DNN but differ type of and manner which it is used. The first method uses parallel data multi-task learning, trained perform both a primary senone classification task secondary feature enhancement using shared representation. second parameterization environment extracted from observed signal...

10.1109/icassp.2015.7178925 article EN 2015-04-01

Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention

OPENALEX - Publications

Dong Yu Wayne Xiong Jasha Droppo Andreas Stolcke Guoli Ye and 2 more

In this paper, we propose a deep convolutional neural network (CNN) with layer-wise context expansion and location-based attention, for large vocabulary speech recognition. our model each higher layer uses information from broader contexts, along both the time frequency dimensions, than its immediate lower layer. We show that attention can be implemented using element-wise matrix product convolution operation. For reason, contrary to other CNNs, no pooling operation is used in model....

10.21437/interspeech.2016-251 article EN Interspeech 2022 2016-08-29

Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion

OPENALEX - Publications

Li Deng Jasha Droppo Alex Acero

This paper presents a new technique for dynamic, frame-by-frame compensation of the Gaussian variances in hidden Markov model (HMM), exploiting feature variance or uncertainty estimated during speech enhancement process, to improve noise-robust recognition. The provides an alternative Bayesian predictive classification decision rule by carrying out integration over space instead model-parameter space, offering much simpler system implementation, lower computational cost, and dynamic...

10.1109/tsa.2005.845814 article EN IEEE Transactions on Speech and Audio Processing 2005-04-19

Uncertainty decoding with SPLICE for noise robust speech recognition

OPENALEX - Publications

Jasha Droppo Alex Acero Li Deng

Speech recognition front end noise removal algorithms have. in the past, estimated clean speech features from corrupted features. The accuracy of process varies frame to frame, and dimension feature stream, due part instantaneous SR input. In this paper, we show that localized knowledge can be directly incorporated into Gaussian evaluation within decoder, produce higher accuracies. To prove concept, modify SPLICE algorithm output uncertainty information, combination with decoding remove...

10.1109/icassp.2002.5743653 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2002-05-01

On parallelizability of stochastic gradient descent for speech DNNS

OPENALEX - Publications

Frank Seide Hao Fu Jasha Droppo Gang Li Dong Yu

This paper compares the theoretical efficiency of model-parallel and data-parallel distributed stochastic gradient descent training DNNs. For a typical Switchboard DNN with 46M parameters, results are not pretty: With modern GPUs interconnects, model parallelism is optimal only 3 in single server, while data minibatch size 1024 does even scale to 2 GPUs. We further show that can be improved by increasing (through combination AdaGrad automatic adjustments learning rate size) compression....

10.1109/icassp.2014.6853593 article EN 2014-05-01

Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition

OPENALEX - Publications

Zhehuai Chen Jasha Droppo Jinyu Li Wayne Xiong

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic (ASR). Permutation invariant training (PIT) a state art model-based approach, which applies single neural network to solve this single-input, multiple-output modeling problem. We propose advance current by imposing modular structure on network, applying progressive pretraining regimen, and improving objective function with transfer learning discriminative criterion. The splits problem into...

10.1109/taslp.2017.2765834 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-10-23

Single-channel Speech Extraction Using Speaker Inventory and Attention Network

OPENALEX - Publications

Xiong Xiao Zhuo Chen Takuya Yoshioka Hakan Erdoğan Changliang Liu and 3 more

Neural network-based speech separation has received a surge of interest in recent years. Previously proposed methods either are speaker independent or extract target speaker's voice by using his her snippet. In applications such as home devices office meeting transcriptions, possible list is available, which can be leveraged for separation. This paper proposes novel extraction method that utilizes an inventory snippets interfering speakers, enrollment data, addition to the speaker....

10.1109/icassp.2019.8682245 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Evaluation of the SPLICE algorithm on the Aurora2 database

OPENALEX - Publications

Jasha Droppo Deng Li Alex Acero

This paper describes recent improvements to SPLICE, Stereobased Piecewise Linear Compensation for Environments, which produces an estimate of cepstrum undistorted speech given the observed distorted speech. For distributed recognition applications, SPLICE can be placed at server, thus limiting processing that would take place client. We evaluated this algorithm on Aurora2 task, consists digit sequences within TIDigits database have been digitally corrupted by passing them through a linear...

10.21437/eurospeech.2001-77 article EN 2001-09-03

High-performance robust speech recognition using stereo training data

OPENALEX - Publications

Li Deng Alex Acero Lì Jiāng Jasha Droppo Xuedong Huang

We describe a novel technique of SPLICE (Stereo-based Piecewise Linear Compensation for Environments) high performance robust speech recognition. It is an efficient noise reduction and channel distortion compensation that makes effective use stereo training data. present new version using the minimum-mean-square-error decision, extension by clusters hidden Markov models (HMMs) with processing. Comprehensive results Wall Street Journal large vocabulary recognition task wide range types...

10.1109/icassp.2001.940827 article EN 2002-11-13

Enhancement of Log Mel Power Spectra of Speech Using a Phase-Sensitive Model of the Acoustic Environment and Sequential Estimation of the Corrupting Noise

OPENALEX - Publications

L. Deng Jasha Droppo Alex Acero

This paper presents a novel speech feature enhancement technique based on probabilistic, nonlinear acoustic environment model that effectively incorporates the phase relationship (hence sensitive) between clean and corrupting noise in distortion process. The core of algorithm is MMSE (minimum mean square error) estimator for log Mel power spectra phase-sensitive model, using highly efficient single-point, second-order Taylor series expansion to approximate joint probability noisy modeled as...

10.1109/tsa.2003.820201 article EN IEEE Transactions on Speech and Audio Processing 2004-03-01

Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition

OPENALEX - Publications

Li Deng Jasha Droppo Alex Acero

We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and successful application the in speech feature enhancement framework noise-normalized SPLICE robust recognition. The makes use nonlinear model environment cepstral domain. Central to is innovative iterative stochastic approximation technique that improves piecewise linear nonlinearity involved subsequently increases accuracy estimation. report comprehensive experiments on...

10.1109/tsa.2003.818076 article EN IEEE Transactions on Speech and Audio Processing 2003-11-01

A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition

OPENALEX - Publications

Dong Yu Li Deng Jasha Droppo Jian Wu Yifan Gong and 1 more

We present a non-linear feature-domain noise reduction algorithm based on the minimum mean square error (MMSE) criterion Mel-frequency cepstra (MFCC) for environment-robust speech recognition. Distinguishing from MMSE enhancement in log spectral amplitude proposed by Ephraim and Malah (E&M) [7], new presented this paper develops suppression rule that applies to power magnitude of filter-banks' outputs MFCC directly, making it demonstrably more effective noise-robust The variance contains...

10.1109/icassp.2008.4518541 article EN Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing 2008-03-01

Noise Adaptive Training for Robust Automatic Speech Recognition

OPENALEX - Publications

Ozlem Kalinli Michael L. Seltzer Jasha Droppo Alex Acero

In traditional methods for noise robust automatic speech recognition, the acoustic models are typically trained using clean or multi-condition data that is processed by same feature enhancement algorithm expected to be used in decoding. this paper, we propose a adaptive training (NAT) can applied all normalizes environmental distortion as part of model training. contrast methods, NAT estimates underlying "pseudo-clean" parameters directly without relying on point features an intermediate...

10.1109/tasl.2010.2040522 article EN IEEE Transactions on Audio Speech and Language Processing 2010-03-30

Comparing Human and Machine Errors in Conversational Speech Transcription

OPENALEX - Publications

Andreas Stolcke Jasha Droppo

Recent work in automatic recognition of conversational telephone speech (CTS) has achieved accuracy levels comparable to human transcribers, although there is some debate how precisely quantify performance on this task, using the NIST 2000 CTS evaluation set. This raises question what systematic differences, if any, may be found differentiating from machine transcription errors. In paper we approach by comparing output our most accurate system that a standard vendor pipeline. We find...

10.21437/interspeech.2017-1544 preprint EN Interspeech 2022 2017-08-16

Efficient Minimum Word Error Rate Training of RNN-Transducer for End-to-End Speech Recognition

OPENALEX - Publications

Jinxi Guo Gautam Tiwari Jasha Droppo Maarten Van Segbroeck Che-Wei Huang and 2 more

In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T).Unlike previous work on topic, which performs on-the-fly limited-size beam-search decoding generates alignment scores expected edit-distance computation, in our proposed method, re-calculate sum of all the possible alignments each hypothesis N-best lists.The probability back-propagated gradients are calculated efficiently using forward-backward algorithm.Moreover, allows...

10.21437/interspeech.2020-1557 article EN Interspeech 2022 2020-10-25

wav2vec-C: A Self-Supervised Model for Speech Representation Learning

OPENALEX - Publications

Samik Sadhu Di He Che-Wei Huang Sri Harish Mallidi Minhua Wu and 4 more

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE.Our model learns to reproduce quantized representations partially masked speech encoding using contrastive loss in way similar 2.0.However, the quantization process is regularized by an additional consistency network that reconstruct input features VQ-VAE model.The proposed self-supervised trained on 10k hours of unlabeled data subsequently used as encoder RNN-T ASR fine-tuned with...

10.21437/interspeech.2021-717 article EN Interspeech 2022 2021-08-27

Coming Soon ...