NFDI4DS | UHH-SEMS - Publication Details

Jinyu Li

ORCID: 0000-0002-1089-9748

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5100365053

Research Areas

Speech Recognition and Synthesis
Speech and Audio Processing
Music and Audio Processing
Natural Language Processing Techniques
Topic Modeling
Speech and dialogue systems
Phonetics and Phonology Research
Blind Source Separation Techniques
Advanced Adaptive Filtering Techniques
Neural Networks and Applications
Advanced Data Compression Techniques
Language and cultural evolution
Privacy, Security, and Data Protection
Domain Adaptation and Few-Shot Learning
Text Readability and Simplification
Indoor and Outdoor Localization Technologies
Law, AI, and Intellectual Property
Digital Transformation in Law
Advanced Image Fusion Techniques
Image and Video Quality Assessment
Markov Chains and Monte Carlo Methods
Explainable Artificial Intelligence (XAI)
Machine Learning and ELM
Machine Learning in Healthcare
Subtitles and Audiovisual Media

Microsoft (United States)
2016-2025

Microsoft (Finland)
2020-2025

Microsoft Research Asia (China)
2024

Laboratoire de Phonétique et Phonologie
2018-2024

Université Sorbonne Nouvelle
2024

Laboratoire de Physique des Plasmas
2024

Microsoft Research (United Kingdom)
2012-2023

Xuzhou Medical College
2023

Industrial and Commercial Bank of China
2022

China University of Geosciences
2022

Recent advances in deep learning for speech research at Microsoft

OPENALEX - Publications

Li Deng Jinyu Li Jui-Ting Huang Kaisheng Yao Dong Yu and 7 more

Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft researchers since 2009 in area, focusing on more recent advances which shed light to basic capabilities and limitations current deep technology. We organize along feature-domain model-domain dimensions according conventional approach analyzing systems. Selected experimental results, including related applications such as spoken dialogue...

10.1109/icassp.2013.6639345 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

OPENALEX - Publications

Sanyuan Chen Chengyi Wang Zhengyang Chen Yu Wu Shujie Liu and 14 more

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other processing tasks. As signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., universal representations all tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream WavLM jointly learns masked prediction and denoising pre-training. By this means,...

10.1109/jstsp.2022.3188113 article EN IEEE Journal of Selected Topics in Signal Processing 2022-07-04

Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers

OPENALEX - Publications

Jui-Ting Huang Jinyu Li Dong Yu Li Deng Yifan Gong

In the deep neural network (DNN), hidden layers can be considered as increasingly complex feature transformations and final softmax layer a log-linear classifier making use of most abstract features computed in layers. While loglinear should different for languages, shared across languages. this paper we propose shared-hidden-layer multilingual DNN (SHL-MDNN), which are made common many languages while language dependent. We demonstrate that SHL-MDNN reduce errors by 3-5%, relatively, all...

10.1109/icassp.2013.6639081 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2013-05-01

An Overview of Noise-Robust Automatic Speech Recognition

OPENALEX - Publications

Jinyu Li Li Deng Yifan Gong Reinhold Haeb‐Umbach

New waves of consumer-centric applications, such as voice search and interaction with mobile devices home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust the full range real-world noise other acoustic distorting conditions. Despite its practical importance, however, inherent links between distinctions among myriad methods for noise-robust ASR have yet carefully studied in order advance field further. To this end, it is critical establish a solid,...

10.1109/taslp.2014.2304637 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2014-02-05

Restructuring of deep neural network acoustic models with singular value decomposition

OPENALEX - Publications

Jian Xue Jinyu Li Yifan Gong

Recently proposed deep neural network (DNN) obtains significant accuracy improvements in many large vocabulary continuous speech recognition (LVCSR) tasks. However, DNN requires much more parameters than traditional systems, which brings huge cost during online evaluation, and also limits the application of a lot scenarios. In this paper we present our new effort on aiming at reducing model size while keeping improvements. We apply singular value decomposition (SVD) weight matrices DNN, then...

10.21437/interspeech.2013-552 article EN Interspeech 2022 2013-08-25

Learning small-size DNN with output-distribution-based criteria

OPENALEX - Publications

Jinyu Li Rui Zhao Jui-Ting Huang Yifan Gong

Deep neural network (DNN) obtains significant accuracy improvements on many speech recognition tasks and its power comes from the deep wide structure with a very large number of parameters. It becomes challenging when we deploy DNN devices which have limited computational storage resources. The common practice is to train small hidden nodes senone set using standard training process, leading loss. In this study, propose better address these issues by utilizing output distribution. To learn...

10.21437/interspeech.2014-432 article EN Interspeech 2022 2014-09-14

Recent Advances in End-to-End Automatic Speech Recognition

OPENALEX - Publications

Jinyu Li

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) for automatic recognition (ASR). While E2E models achieve state-of-the-art results in most benchmarks terms ASR accuracy, are still used large proportion commercial systems at current time. There lots practical factors that affect production model deployment decision. Traditional models, being optimized decades, usually good these factors. Without...

10.1561/116.00000050 article EN cc-by-nc APSIPA Transactions on Signal and Information Processing 2022-01-01

Continuous Speech Separation: Dataset and Analysis

OPENALEX - Publications

Zhuo Chen Takuya Yoshioka Liang Lu Tianyan Zhou Zhong Meng and 4 more

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies use pre-segmented audio signals, which are typically generated by mixing utterances on computers so that they fully overlap. Also, the algorithms have often been evaluated based signal-based metrics such as signal-to-distortion ratio. However, in natural conversations, signals contain both overlapped overlap-free regions. In addition, only weak correlation with automatic...

10.1109/icassp40776.2020.9053426 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

Improving RNN Transducer Modeling for End-to-End Speech Recognition

OPENALEX - Publications

Jinyu Li Rui Zhao Hu Hu Yifan Gong

In the last few years, an emerging trend in automatic speech recognition research is study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are most popular three methods. Among these methods, RNN-T has advantages to do online streaming which challenging AED it doesn't have CTC's frame-independence assumption. this paper, we improve training two aspects. First, optimize algorithm reduce memory consumption so...

10.1109/asru46091.2019.9003906 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019-12-01

Recent progresses in deep learning based acoustic models

OPENALEX - Publications

Dong Yu Jinyu Li

In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation insights behind surveyed techniques. We first discuss such as recurrent neural networks (RNNs) convolutional (CNNs) that can effectively exploit variable-length contextual information, their various combination with other models. then describe are optimized end-to-end emphasize on feature representations learned jointly rest of system, connectionist temporal classification (CTC)...

10.1109/jas.2017.7510508 article EN IEEE/CAA Journal of Automatica Sinica 2017-01-01

End-to-End attention based text-dependent speaker verification

OPENALEX - Publications

Shi-Xiong Zhang Zhuo Chen Yong Zhao Jinyu Li Yifan Gong

A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level representation (d-vector i-vector). In work we use CNN extract noise-robust features. These smartly combined form vector through attention...

10.1109/slt.2016.7846261 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2016-12-01

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

OPENALEX - Publications

Chen Xie Yu Wu Zhenghao Wang Shujie Liu Jinyu Li

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of during inference is a key issue prevent their applications. In this work, we explored potential Transducer (T-T) for fist pass decoding with low latency and fast speed on large-scale dataset. We combine idea Transformer- XL chunk-wise streaming processing design streamable model. demonstrate that T-T...

10.1109/icassp39728.2021.9413535 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

OPENALEX - Publications

Chengyi Wang Sanyuan Chen Yu Wu Ziqiang Zhang Long Zhou and 8 more

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train neural codec model (called Vall-E) using discrete codes derived from an off-the-shelf audio model, and regard TTS as conditional task rather than continuous signal regression in previous work. During the pre-training stage, scale up training data 60K hours of English which is hundreds times larger existing systems. Vall-E emerges in-context learning capabilities can be used synthesize...

10.48550/arxiv.2301.02111 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Continuous Speech Separation with Conformer

OPENALEX - Publications

Sanyuan Chen Yu Wu Zhuo Chen Jian Wu Jinyu Li and 4 more

Continuous speech separation was recently proposed to deal with the overlapped in natural conversations. While it shown significantly improve recognition performance for multichannel conversation transcription, its effectiveness has yet be proven a single-channel recording scenario. This paper examines use of Conformer architecture lieu recurrent neural networks model. allows model efficiently capture both local and global context information, which is helpful separation. Experimental...

10.1109/icassp39728.2021.9413423 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

OPENALEX - Publications

Junyi Ao Rui Wang Long Zhou Chengyi Wang Shuo Ren and 9 more

Junyi Ao, Rui Wang, Long Zhou, Chengyi Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Zhang, Zhihua Wei, Yao Qian, Jinyu Furu Wei. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Papers). 2022.

10.18653/v1/2022.acl-long.393 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

OPENALEX - Publications

Sanyuan Chen Chengyi Wang Yu Wu Ziqiang Zhang Long Zhou and 8 more

10.1109/taslpro.2025.3530270 article EN IEEE Transactions on Audio Speech and Language Processing 2025-01-01

Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network

OPENALEX - Publications

Jian Xue Jinyu Li Dong Yu Mike Seltzer Yifan Gong

The large number of parameters in deep neural networks (DNN) for automatic speech recognition (ASR) makes speaker adaptation very challenging. It also limits the use personalization due to huge storage cost large-scale deployments. In this paper we address DNN and issues by presenting two methods based on singular value decomposition (SVD). first method uses an SVD replace weight matrix a independent product low rank matrices. Adaptation is then performed updating square inserted between...

10.1109/icassp.2014.6854828 article EN 2014-05-01

Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network

OPENALEX - Publications

Zhuo Chen Xiong Xiao Takuya Yoshioka Hakan Erdoğan Jinyu Li and 1 more

Although advances in close-talk speech recognition have resulted relatively low error rates, the performance far-field environments is still limited due to signal-to-noise ratio, reverberation, and overlapped from simultaneous speakers which especially more difficult. To solve these problems, beamforming separation networks were previously proposed. However, they tend suffer leakage of interfering or generalizability. In this work, we propose a simple yet effective method for multi-channel...

10.1109/slt.2018.8639593 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2018-12-01

Large-Scale Domain Adaptation via Teacher-Student Learning

OPENALEX - Publications

Jinyu Li Michael L. Seltzer Xi Wang Rui Zhao Yifan Gong

High accuracy speech recognition requires a large amount of transcribed data for supervised training.In the absence such data, domain adaptation well-trained acoustic model can be performed, but even here, high usually significant labeled from target domain.In this work, we propose an approach to that does not require transcriptions instead uses corpus unlabeled parallel consisting pairs samples source and desired domain.To perform adaptation, employ teacher/student (T/S) learning, in which...

10.21437/interspeech.2017-519 preprint EN Interspeech 2022 2017-08-16

On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

OPENALEX - Publications

Jinyu Li Yu Wu Yashesh Gaur Chengyi Wang Rui Zhao and 1 more

Recently, there has been a strong push to transition from hybrid models end-to-end (E2E) for automatic speech recognition.Currently, are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attentionbased encoder-decoder (AED), and Transformer-AED.In this study, we conduct an empirical comparison of RNN-T, RNN-AED, Transformer-AED models, in both non-streaming streaming modes.We use 65 thousand hours Microsoft anonymized training data train these models.As more...

10.21437/interspeech.2020-2846 article EN Interspeech 2022 2020-10-25

An analysis of convolutional neural networks for speech recognition

OPENALEX - Publications

Jui-Ting Huang Jinyu Li Yifan Gong

Despite the fact that several sites have reported effectiveness of convolutional neural networks (CNNs) on some tasks, there is no deep analysis regarding why CNNs perform well and in which case we should see CNNs' advantage. In light this, this paper aims to provide detailed CNNs. By visualizing localized filters learned layer, show edge detectors varying directions can be automatically learned. We then identify four domains think consistently advantages over fully-connected (DNNs):...

10.1109/icassp.2015.7178920 article EN 2015-04-01

Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM

OPENALEX - Publications

Jinyu Li Dong Yu Jui-Ting Huang Yifan Gong

Context-dependent deep neural network hidden Markov model (CD-DNN-HMM) is a recently proposed acoustic that significantly outperformed Gaussian mixture (GMM)-HMM systems in many large vocabulary speech recognition (LVSR) tasks. In this paper we present our strategy of using mixed-bandwidth training data to improve wideband accuracy the CD-DNN-HMM framework. We show DNNs provide flexibility arbitrary features. By Mel-scale log-filter bank features not only achieve higher than MFCCs, but also...

10.1109/slt.2012.6424210 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2012-12-01

LSTM time and frequency recurrence for automatic speech recognition

OPENALEX - Publications

Jinyu Li Abdelrahman Mohamed Geoffrey Zweig Yifan Gong

Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward (DNNs). A key aspect of these models is the use time recurrence, combined with a gating architecture that ameliorates vanishing gradient problem. Inspired by human spectrogram reading, in this paper we propose an extension to LSTMs performs recurrence frequency as well time. This model first scans bands generate summary spectral information, and then...

10.1109/asru.2015.7404793 article EN 2015-12-01

Coming Soon ...