- Speech Recognition and Synthesis
- Speech and Audio Processing
- Music and Audio Processing
- Natural Language Processing Techniques
- Topic Modeling
- Speech and dialogue systems
- Phonetics and Phonology Research
- Blind Source Separation Techniques
- Advanced Adaptive Filtering Techniques
- Neural Networks and Applications
- Advanced Data Compression Techniques
- Language and cultural evolution
- Privacy, Security, and Data Protection
- Domain Adaptation and Few-Shot Learning
- Text Readability and Simplification
- Indoor and Outdoor Localization Technologies
- Law, AI, and Intellectual Property
- Digital Transformation in Law
- Advanced Image Fusion Techniques
- Image and Video Quality Assessment
- Markov Chains and Monte Carlo Methods
- Explainable Artificial Intelligence (XAI)
- Machine Learning and ELM
- Machine Learning in Healthcare
- Subtitles and Audiovisual Media
Microsoft (United States)
2016-2025
Microsoft (Finland)
2020-2025
Microsoft Research Asia (China)
2024
Laboratoire de Phonétique et Phonologie
2018-2024
Université Sorbonne Nouvelle
2024
Laboratoire de Physique des Plasmas
2024
Microsoft Research (United Kingdom)
2012-2023
Xuzhou Medical College
2023
Industrial and Commercial Bank of China
2022
China University of Geosciences
2022
Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft researchers since 2009 in area, focusing on more recent advances which shed light to basic capabilities and limitations current deep technology. We organize along feature-domain model-domain dimensions according conventional approach analyzing systems. Selected experimental results, including related applications such as spoken dialogue...
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other processing tasks. As signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., universal representations all tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream WavLM jointly learns masked prediction and denoising pre-training. By this means,...
In the deep neural network (DNN), hidden layers can be considered as increasingly complex feature transformations and final softmax layer a log-linear classifier making use of most abstract features computed in layers. While loglinear should different for languages, shared across languages. this paper we propose shared-hidden-layer multilingual DNN (SHL-MDNN), which are made common many languages while language dependent. We demonstrate that SHL-MDNN reduce errors by 3-5%, relatively, all...
New waves of consumer-centric applications, such as voice search and interaction with mobile devices home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust the full range real-world noise other acoustic distorting conditions. Despite its practical importance, however, inherent links between distinctions among myriad methods for noise-robust ASR have yet carefully studied in order advance field further. To this end, it is critical establish a solid,...
Recently proposed deep neural network (DNN) obtains significant accuracy improvements in many large vocabulary continuous speech recognition (LVCSR) tasks. However, DNN requires much more parameters than traditional systems, which brings huge cost during online evaluation, and also limits the application of a lot scenarios. In this paper we present our new effort on aiming at reducing model size while keeping improvements. We apply singular value decomposition (SVD) weight matrices DNN, then...
Deep neural network (DNN) obtains significant accuracy improvements on many speech recognition tasks and its power comes from the deep wide structure with a very large number of parameters. It becomes challenging when we deploy DNN devices which have limited computational storage resources. The common practice is to train small hidden nodes senone set using standard training process, leading loss. In this study, propose better address these issues by utilizing output distribution. To learn...
Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) for automatic recognition (ASR). While E2E models achieve state-of-the-art results in most benchmarks terms ASR accuracy, are still used large proportion commercial systems at current time. There lots practical factors that affect production model deployment decision. Traditional models, being optimized decades, usually good these factors. Without...
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies use pre-segmented audio signals, which are typically generated by mixing utterances on computers so that they fully overlap. Also, the algorithms have often been evaluated based signal-based metrics such as signal-to-distortion ratio. However, in natural conversations, signals contain both overlapped overlap-free regions. In addition, only weak correlation with automatic...
In the last few years, an emerging trend in automatic speech recognition research is study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are most popular three methods. Among these methods, RNN-T has advantages to do online streaming which challenging AED it doesn't have CTC's frame-independence assumption. this paper, we improve training two aspects. First, optimize algorithm reduce memory consumption so...
In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation insights behind surveyed techniques. We first discuss such as recurrent neural networks (RNNs) convolutional (CNNs) that can effectively exploit variable-length contextual information, their various combination with other models. then describe are optimized end-to-end emphasize on feature representations learned jointly rest of system, connectionist temporal classification (CTC)...
A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level representation (d-vector i-vector). In work we use CNN extract noise-robust features. These smartly combined form vector through attention...
Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of during inference is a key issue prevent their applications. In this work, we explored potential Transducer (T-T) for fist pass decoding with low latency and fast speed on large-scale dataset. We combine idea Transformer- XL chunk-wise streaming processing design streamable model. demonstrate that T-T...
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train neural codec model (called Vall-E) using discrete codes derived from an off-the-shelf audio model, and regard TTS as conditional task rather than continuous signal regression in previous work. During the pre-training stage, scale up training data 60K hours of English which is hundreds times larger existing systems. Vall-E emerges in-context learning capabilities can be used synthesize...
Continuous speech separation was recently proposed to deal with the overlapped in natural conversations. While it shown significantly improve recognition performance for multichannel conversation transcription, its effectiveness has yet be proven a single-channel recording scenario. This paper examines use of Conformer architecture lieu recurrent neural networks model. allows model efficiently capture both local and global context information, which is helpful separation. Experimental...
Junyi Ao, Rui Wang, Long Zhou, Chengyi Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Zhang, Zhihua Wei, Yao Qian, Jinyu Furu Wei. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Papers). 2022.
The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use personalization due to huge storage cost large-scale deployments. In this paper we address DNN and issues by presenting two methods based on singular value decomposition (SVD). first method uses an SVD replace weight matrix a independent product low rank matrices. Adaptation is then performed updating square inserted between...
Although advances in close-talk speech recognition have resulted relatively low error rates, the performance far-field environments is still limited due to signal-to-noise ratio, reverberation, and overlapped from simultaneous speakers which especially more difficult. To solve these problems, beamforming separation networks were previously proposed. However, they tend suffer leakage of interfering or generalizability. In this work, we propose a simple yet effective method for multi-channel...
High accuracy speech recognition requires a large amount of transcribed data for supervised training.In the absence such data, domain adaptation well-trained acoustic model can be performed, but even here, high usually significant labeled from target domain.In this work, we propose an approach to that does not require transcriptions instead uses corpus unlabeled parallel consisting pairs samples source and desired domain.To perform adaptation, employ teacher/student (T/S) learning, in which...
Recently, there has been a strong push to transition from hybrid models end-to-end (E2E) for automatic speech recognition.Currently, are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attentionbased encoder-decoder (AED), and Transformer-AED.In this study, we conduct an empirical comparison of RNN-T, RNN-AED, Transformer-AED models, in both non-streaming streaming modes.We use 65 thousand hours Microsoft anonymized training data train these models.As more...
Despite the fact that several sites have reported effectiveness of convolutional neural networks (CNNs) on some tasks, there is no deep analysis regarding why CNNs perform well and in which case we should see CNNs' advantage. In light this, this paper aims to provide detailed CNNs. By visualizing localized filters learned layer, show edge detectors varying directions can be automatically learned. We then identify four domains think consistently advantages over fully-connected (DNNs):...
Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic that significantly outperformed Gaussian mixture (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixed-bandwidth training data to improve wideband accuracy the CD-DNN-HMM framework. We show DNNs provide flexibility arbitrary features. By Mel-scale log-filter bank features not only achieve higher than MFCCs, but also...
Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward (DNNs). A key aspect of these models is the use time recurrence, combined with a gating architecture that ameliorates vanishing gradient problem. Inspired by human spectrogram reading, in this paper we propose an extension to LSTMs performs recurrence frequency as well time. This model first scans bands generate summary spectral information, and then...