Jasha Droppo

ORCID: 0000-0001-6097-0090
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Nuclear and radioactivity studies
  • Speech and dialogue systems
  • Graphite, nuclear technology, radiation studies
  • Topic Modeling
  • Advanced Adaptive Filtering Techniques
  • Advanced Data Compression Techniques
  • Neural Networks and Applications
  • Radioactive contamination and transfer
  • Radioactivity and Radon Measurements
  • Blind Source Separation Techniques
  • Risk and Safety Analysis
  • Machine Fault Diagnosis Techniques
  • Atmospheric chemistry and aerosols
  • Music Technology and Sound Studies
  • Wind and Air Flow Studies
  • Meteorological Phenomena and Simulations
  • Underwater Acoustics Research
  • Phonetics and Phonology Research
  • Domain Adaptation and Few-Shot Learning
  • Image and Signal Denoising Methods
  • Voice and Speech Disorders

Amazon (United States)
2020-2024

Amazon (Germany)
2021-2022

Seattle University
2021

Microsoft (United States)
2010-2019

Microsoft Research (United Kingdom)
2001-2018

Shanghai Jiao Tong University
2017

Pacific Northwest National Laboratory
1981-2007

Phoenix Contact (United States)
2007

University of Washington
1997-2002

Office of Scientific and Technical Information
1996

We show empirically that in SGD training of deep neural networks, one can, at no or nearly loss accuracy, quantize the gradients aggressively—to but bit per value—if quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize through data-parallelism with fast processors like recent GPUs. implement data-parallel deterministically distributed by combining this finding AdaGrad, automatic minibatch-size selection, double...

10.21437/interspeech.2014-274 article EN Interspeech 2022 2014-09-14

Conversational speech recognition has served as a flagship task since the release of Switchboard corpus in 1990s. In this paper, we measure human error rate on widely used NIST 2000 test set, and find that our latest automated system reached parity. The professional transcribers is 5.9% for portion data, which newly acquainted pairs people discuss an assigned topic, 11.3% CallHome where friends family members have open-ended conversations. both cases, establishes new state art, edges past...

10.48550/arxiv.1610.05256 preprint EN other-oa arXiv (Cornell University) 2016-01-01

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments neural-network-based acoustic and language modeling to further advance state art on Switchboard task. The adds a CNN-BLSTM model set architectures combined previously, includes character-based dialog session aware LSTM models rescoring. For combination adopt two-stage approach, whereby subsets are first at senone/frame level, followed by word-level...

10.1109/icassp.2018.8461870 preprint EN 2018-04-01

We describe Microsoft's conversational speech recognition system, in which we combine recent developments neural-network-based acoustic and language modeling to advance the state of art on Switchboard task. Inspired by machine learning ensemble techniques, system uses a range convolutional recurrent neural networks. I-vector lattice-free MMI training provide significant gains for all model architectures. Language rescoring with multiple forward backward running RNNLMs, word posterior-based...

10.1109/icassp.2017.7953159 preprint EN 2017-03-01

In this paper we demonstrate how to improve the performance of deep neural network (DNN) acoustic models using multi-task learning. learning, is trained perform both primary classification task and one or more secondary tasks a shared representation. The additional model parameters associated with represent very small increase in number parameters, can be discarded at runtime. paper, explore three natural choices for task: phone label, context, state context. We that, even on strong...

10.1109/icassp.2013.6639012 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Conversational speech recognition has served as a flagship task since the release of Switchboard corpus in 1990s. In this paper, we measure human error rate on widely used NIST 2000 test set for commercial bulk transcription. The professional transcribers is 5.9% portion data, which newly acquainted pairs people discuss an assigned topic, and 11.3% CallHome portion, where friends family members have open-ended conversations. both cases, our automated system edges past benchmark, achieving...

10.1109/taslp.2017.2756440 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-09-25

We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains five key ingredients: a multi-style training strategy artificially mixed data, separate DNN to estimate senone posterior probabilities of louder and softer speakers at each frame, weighted finite-state transducer (WFST)-based two-talker decoder jointly correlate speaker speech, switching penalty estimated from energy pattern...

10.1109/taslp.2015.2444659 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2015-06-12

This paper advances the design of CTC-based all-neural (or end-to-end) speech recognizers. We propose a novel symbol inventory, and iterated-CTC method in which second system is used to transform noisy initial output into cleaner version. present number stabilization initialization methods we have found useful training these networks. evaluate our on commonly NIST 2000 conversational telephony test set, significantly exceed previously published performance similar systems, both with without...

10.1109/icassp.2017.7953069 preprint EN 2017-03-01

In this paper, we propose two approaches to improve deep neural network (DNN) acoustic models for speech recognition in reverberant environments. Both methods utilize auxiliary information training the DNN but differ type of and manner which it is used. The first method uses parallel data multi-task learning, trained perform both a primary senone classification task secondary feature enhancement using shared representation. second parameterization environment extracted from observed signal...

10.1109/icassp.2015.7178925 article EN 2015-04-01

In this paper, we propose a deep convolutional neural network (CNN) with layer-wise context expansion and location-based attention, for large vocabulary speech recognition. our model each higher layer uses information from broader contexts, along both the time frequency dimensions, than its immediate lower layer. We show that attention can be implemented using element-wise matrix product convolution operation. For reason, contrary to other CNNs, no pooling operation is used in model....

10.21437/interspeech.2016-251 article EN Interspeech 2022 2016-08-29

This paper presents a new technique for dynamic, frame-by-frame compensation of the Gaussian variances in hidden Markov model (HMM), exploiting feature variance or uncertainty estimated during speech enhancement process, to improve noise-robust recognition. The provides an alternative Bayesian predictive classification decision rule by carrying out integration over space instead model-parameter space, offering much simpler system implementation, lower computational cost, and dynamic...

10.1109/tsa.2005.845814 article EN IEEE Transactions on Speech and Audio Processing 2005-04-19

Speech recognition front end noise removal algorithms have. in the past, estimated clean speech features from corrupted features. The accuracy of process varies frame to frame, and dimension feature stream, due part instantaneous SR input. In this paper, we show that localized knowledge can be directly incorporated into Gaussian evaluation within decoder, produce higher accuracies. To prove concept, modify SPLICE algorithm output uncertainty information, combination with decoding remove...

10.1109/icassp.2002.5743653 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2002-05-01

This paper compares the theoretical efficiency of model-parallel and data-parallel distributed stochastic gradient descent training DNNs. For a typical Switchboard DNN with 46M parameters, results are not pretty: With modern GPUs interconnects, model parallelism is optimal only 3 in single server, while data minibatch size 1024 does even scale to 2 GPUs. We further show that can be improved by increasing (through combination AdaGrad automatic adjustments learning rate size) compression....

10.1109/icassp.2014.6853593 article EN 2014-05-01

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic (ASR). Permutation invariant training (PIT) a state art model-based approach, which applies single neural network to solve this single-input, multiple-output modeling problem. We propose advance current by imposing modular structure on network, applying progressive pretraining regimen, and improving objective function with transfer learning discriminative criterion. The splits problem into...

10.1109/taslp.2017.2765834 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-10-23

Neural network-based speech separation has received a surge of interest in recent years. Previously proposed methods either are speaker independent or extract target speaker's voice by using his her snippet. In applications such as home devices office meeting transcriptions, possible list is available, which can be leveraged for separation. This paper proposes novel extraction method that utilizes an inventory snippets interfering speakers, enrollment data, addition to the speaker....

10.1109/icassp.2019.8682245 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

This paper describes recent improvements to SPLICE, Stereobased Piecewise Linear Compensation for Environments, which produces an estimate of cepstrum undistorted speech given the observed distorted speech. For distributed recognition applications, SPLICE can be placed at server, thus limiting processing that would take place client. We evaluated this algorithm on Aurora2 task, consists digit sequences within TIDigits database have been digitally corrupted by passing them through a linear...

10.21437/eurospeech.2001-77 article EN 2001-09-03

We describe a novel technique of SPLICE (Stereo-based Piecewise Linear Compensation for Environments) high performance robust speech recognition. It is an efficient noise reduction and channel distortion compensation that makes effective use stereo training data. present new version using the minimum-mean-square-error decision, extension by clusters hidden Markov models (HMMs) with processing. Comprehensive results Wall Street Journal large vocabulary recognition task wide range types...

10.1109/icassp.2001.940827 article EN 2002-11-13

This paper presents a novel speech feature enhancement technique based on probabilistic, nonlinear acoustic environment model that effectively incorporates the phase relationship (hence sensitive) between clean and corrupting noise in distortion process. The core of algorithm is MMSE (minimum mean square error) estimator for log Mel power spectra phase-sensitive model, using highly efficient single-point, second-order Taylor series expansion to approximate joint probability noisy modeled as...

10.1109/tsa.2003.820201 article EN IEEE Transactions on Speech and Audio Processing 2004-03-01

We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and successful application the in speech feature enhancement framework noise-normalized SPLICE robust recognition. The makes use nonlinear model environment cepstral domain. Central to is innovative iterative stochastic approximation technique that improves piecewise linear nonlinearity involved subsequently increases accuracy estimation. report comprehensive experiments on...

10.1109/tsa.2003.818076 article EN IEEE Transactions on Speech and Audio Processing 2003-11-01

We present a non-linear feature-domain noise reduction algorithm based on the minimum mean square error (MMSE) criterion Mel-frequency cepstra (MFCC) for environment-robust speech recognition. Distinguishing from MMSE enhancement in log spectral amplitude proposed by Ephraim and Malah (E&M) [7], new presented this paper develops suppression rule that applies to power magnitude of filter-banks' outputs MFCC directly, making it demonstrably more effective noise-robust The variance contains...

10.1109/icassp.2008.4518541 article EN Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing 2008-03-01

In traditional methods for noise robust automatic speech recognition, the acoustic models are typically trained using clean or multi-condition data that is processed by same feature enhancement algorithm expected to be used in decoding. this paper, we propose a adaptive training (NAT) can applied all normalizes environmental distortion as part of model training. contrast methods, NAT estimates underlying "pseudo-clean" parameters directly without relying on point features an intermediate...

10.1109/tasl.2010.2040522 article EN IEEE Transactions on Audio Speech and Language Processing 2010-03-30

Recent work in automatic recognition of conversational telephone speech (CTS) has achieved accuracy levels comparable to human transcribers, although there is some debate how precisely quantify performance on this task, using the NIST 2000 CTS evaluation set. This raises question what systematic differences, if any, may be found differentiating from machine transcription errors. In paper we approach by comparing output our most accurate system that a standard vendor pipeline. We find...

10.21437/interspeech.2017-1544 preprint EN Interspeech 2022 2017-08-16

In this work, we propose a novel and efficient minimum word error rate (MWER) training method for RNN-Transducer (RNN-T).Unlike previous work on topic, which performs on-the-fly limited-size beam-search decoding generates alignment scores expected edit-distance computation, in our proposed method, re-calculate sum of all the possible alignments each hypothesis N-best lists.The probability back-propagated gradients are calculated efficiently using forward-backward algorithm.Moreover, allows...

10.21437/interspeech.2020-1557 article EN Interspeech 2022 2020-10-25

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE.Our model learns to reproduce quantized representations partially masked speech encoding using contrastive loss in way similar 2.0.However, the quantization process is regularized by an additional consistency network that reconstruct input features VQ-VAE model.The proposed self-supervised trained on 10k hours of unlabeled data subsequently used as encoder RNN-T ASR fine-tuned with...

10.21437/interspeech.2021-717 article EN Interspeech 2022 2021-08-27
Coming Soon ...