Jinyu Li

ORCID: 0000-0002-1089-9748
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Speech and Audio Processing
  • Music and Audio Processing
  • Natural Language Processing Techniques
  • Topic Modeling
  • Speech and dialogue systems
  • Phonetics and Phonology Research
  • Blind Source Separation Techniques
  • Advanced Adaptive Filtering Techniques
  • Neural Networks and Applications
  • Advanced Data Compression Techniques
  • Language and cultural evolution
  • Privacy, Security, and Data Protection
  • Domain Adaptation and Few-Shot Learning
  • Text Readability and Simplification
  • Indoor and Outdoor Localization Technologies
  • Law, AI, and Intellectual Property
  • Digital Transformation in Law
  • Advanced Image Fusion Techniques
  • Image and Video Quality Assessment
  • Markov Chains and Monte Carlo Methods
  • Explainable Artificial Intelligence (XAI)
  • Machine Learning and ELM
  • Machine Learning in Healthcare
  • Subtitles and Audiovisual Media

Microsoft (United States)
2016-2025

Microsoft (Finland)
2020-2025

Microsoft Research Asia (China)
2024

Laboratoire de Phonétique et Phonologie
2018-2024

Université Sorbonne Nouvelle
2024

Laboratoire de Physique des Plasmas
2024

Microsoft Research (United Kingdom)
2012-2023

Xuzhou Medical College
2023

Industrial and Commercial Bank of China
2022

China University of Geosciences
2022

Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft researchers since 2009 in area, focusing on more recent advances which shed light to basic capabilities and limitations current deep technology. We organize along feature-domain model-domain dimensions according conventional approach analyzing systems. Selected experimental results, including related applications such as spoken dialogue...

10.1109/icassp.2013.6639345 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other processing tasks. As signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., universal representations all tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream WavLM jointly learns masked prediction and denoising pre-training. By this means,...

10.1109/jstsp.2022.3188113 article EN IEEE Journal of Selected Topics in Signal Processing 2022-07-04

In the deep neural network (DNN), hidden layers can be considered as increasingly complex feature transformations and final softmax layer a log-linear classifier making use of most abstract features computed in layers. While loglinear should different for languages, shared across languages. this paper we propose shared-hidden-layer multilingual DNN (SHL-MDNN), which are made common many languages while language dependent. We demonstrate that SHL-MDNN reduce errors by 3-5%, relatively, all...

10.1109/icassp.2013.6639081 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

New waves of consumer-centric applications, such as voice search and interaction with mobile devices home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust the full range real-world noise other acoustic distorting conditions. Despite its practical importance, however, inherent links between distinctions among myriad methods for noise-robust ASR have yet carefully studied in order advance field further. To this end, it is critical establish a solid,...

10.1109/taslp.2014.2304637 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2014-02-05

Recently proposed deep neural network (DNN) obtains significant accuracy improvements in many large vocabulary continuous speech recognition (LVCSR) tasks. However, DNN requires much more parameters than traditional systems, which brings huge cost during online evaluation, and also limits the application of a lot scenarios. In this paper we present our new effort on aiming at reducing model size while keeping improvements. We apply singular value decomposition (SVD) weight matrices DNN, then...

10.21437/interspeech.2013-552 article EN Interspeech 2022 2013-08-25

Deep neural network (DNN) obtains significant accuracy improvements on many speech recognition tasks and its power comes from the deep wide structure with a very large number of parameters. It becomes challenging when we deploy DNN devices which have limited computational storage resources. The common practice is to train small hidden nodes senone set using standard training process, leading loss. In this study, propose better address these issues by utilizing output distribution. To learn...

10.21437/interspeech.2014-432 article EN Interspeech 2022 2014-09-14

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) for automatic recognition (ASR). While E2E models achieve state-of-the-art results in most benchmarks terms ASR accuracy, are still used large proportion commercial systems at current time. There lots practical factors that affect production model deployment decision. Traditional models, being optimized decades, usually good these factors. Without...

10.1561/116.00000050 article EN cc-by-nc APSIPA Transactions on Signal and Information Processing 2022-01-01

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies use pre-segmented audio signals, which are typically generated by mixing utterances on computers so that they fully overlap. Also, the algorithms have often been evaluated based signal-based metrics such as signal-to-distortion ratio. However, in natural conversations, signals contain both overlapped overlap-free regions. In addition, only weak correlation with automatic...

10.1109/icassp40776.2020.9053426 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

In the last few years, an emerging trend in automatic speech recognition research is study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are most popular three methods. Among these methods, RNN-T has advantages to do online streaming which challenging AED it doesn't have CTC's frame-independence assumption. this paper, we improve training two aspects. First, optimize algorithm reduce memory consumption so...

10.1109/asru46091.2019.9003906 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation insights behind surveyed techniques. We first discuss such as recurrent neural networks (RNNs) convolutional (CNNs) that can effectively exploit variable-length contextual information, their various combination with other models. then describe are optimized end-to-end emphasize on feature representations learned jointly rest of system, connectionist temporal classification (CTC)...

10.1109/jas.2017.7510508 article EN IEEE/CAA Journal of Automatica Sinica 2017-01-01

A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level representation (d-vector i-vector). In work we use CNN extract noise-robust features. These smartly combined form vector through attention...

10.1109/slt.2016.7846261 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2016-12-01

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of during inference is a key issue prevent their applications. In this work, we explored potential Transducer (T-T) for fist pass decoding with low latency and fast speed on large-scale dataset. We combine idea Transformer- XL chunk-wise streaming processing design streamable model. demonstrate that T-T...

10.1109/icassp39728.2021.9413535 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train neural codec model (called Vall-E) using discrete codes derived from an off-the-shelf audio model, and regard TTS as conditional task rather than continuous signal regression in previous work. During the pre-training stage, scale up training data 60K hours of English which is hundreds times larger existing systems. Vall-E emerges in-context learning capabilities can be used synthesize...

10.48550/arxiv.2301.02111 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Continuous speech separation was recently proposed to deal with the overlapped in natural conversations. While it shown significantly improve recognition performance for multichannel conversation transcription, its effectiveness has yet be proven a single-channel recording scenario. This paper examines use of Conformer architecture lieu recurrent neural networks model. allows model efficiently capture both local and global context information, which is helpful separation. Experimental...

10.1109/icassp39728.2021.9413423 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Junyi Ao, Rui Wang, Long Zhou, Chengyi Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Zhang, Zhihua Wei, Yao Qian, Jinyu Furu Wei. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Papers). 2022.

10.18653/v1/2022.acl-long.393 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use personalization due to huge storage cost large-scale deployments. In this paper we address DNN and issues by presenting two methods based on singular value decomposition (SVD). first method uses an SVD replace weight matrix a independent product low rank matrices. Adaptation is then performed updating square inserted between...

10.1109/icassp.2014.6854828 article EN 2014-05-01

Although advances in close-talk speech recognition have resulted relatively low error rates, the performance far-field environments is still limited due to signal-to-noise ratio, reverberation, and overlapped from simultaneous speakers which especially more difficult. To solve these problems, beamforming separation networks were previously proposed. However, they tend suffer leakage of interfering or generalizability. In this work, we propose a simple yet effective method for multi-channel...

10.1109/slt.2018.8639593 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

High accuracy speech recognition requires a large amount of transcribed data for supervised training.In the absence such data, domain adaptation well-trained acoustic model can be performed, but even here, high usually significant labeled from target domain.In this work, we propose an approach to that does not require transcriptions instead uses corpus unlabeled parallel consisting pairs samples source and desired domain.To perform adaptation, employ teacher/student (T/S) learning, in which...

10.21437/interspeech.2017-519 preprint EN Interspeech 2022 2017-08-16

Recently, there has been a strong push to transition from hybrid models end-to-end (E2E) for automatic speech recognition.Currently, are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attentionbased encoder-decoder (AED), and Transformer-AED.In this study, we conduct an empirical comparison of RNN-T, RNN-AED, Transformer-AED models, in both non-streaming streaming modes.We use 65 thousand hours Microsoft anonymized training data train these models.As more...

10.21437/interspeech.2020-2846 article EN Interspeech 2022 2020-10-25

Despite the fact that several sites have reported effectiveness of convolutional neural networks (CNNs) on some tasks, there is no deep analysis regarding why CNNs perform well and in which case we should see CNNs' advantage. In light this, this paper aims to provide detailed CNNs. By visualizing localized filters learned layer, show edge detectors varying directions can be automatically learned. We then identify four domains think consistently advantages over fully-connected (DNNs):...

10.1109/icassp.2015.7178920 article EN 2015-04-01

Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic that significantly outperformed Gaussian mixture (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixed-bandwidth training data to improve wideband accuracy the CD-DNN-HMM framework. We show DNNs provide flexibility arbitrary features. By Mel-scale log-filter bank features not only achieve higher than MFCCs, but also...

10.1109/slt.2012.6424210 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2012-12-01

Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward (DNNs). A key aspect of these models is the use time recurrence, combined with a gating architecture that ameliorates vanishing gradient problem. Inspired by human spectrogram reading, in this paper we propose an extension to LSTMs performs recurrence frequency as well time. This model first scans bands generate summary spectral information, and then...

10.1109/asru.2015.7404793 article EN 2015-12-01
Coming Soon ...