Zhehuai Chen

ORCID: 0000-0003-4400-5340
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Speech Recognition and Synthesis
  • Natural Language Processing Techniques
  • Topic Modeling
  • Speech and Audio Processing
  • Music and Audio Processing
  • Speech and dialogue systems
  • Algorithms and Data Compression
  • Image Processing and 3D Reconstruction
  • DNA and Biological Computing
  • Time Series Analysis and Forecasting
  • Hate Speech and Cyberbullying Detection
  • Anomaly Detection Techniques and Applications
  • Text Readability and Simplification
  • Phonetics and Phonology Research
  • Domain Adaptation and Few-Shot Learning
  • Intelligent Tutoring Systems and Adaptive Learning
  • Organoboron and organosilicon chemistry
  • Handwritten Text Recognition Techniques
  • Genomics and Phylogenetic Studies
  • Cellular Automata and Applications
  • Advanced biosensing and bioanalysis techniques
  • Network Security and Intrusion Detection

Nvidia (United States)
2024

Google (United States)
2020-2023

Shanghai Jiao Tong University
2014-2021

Institute for Language and Speech Processing
2021

Johns Hopkins University
2021

Menlo School
2019

Meta (United States)
2019

Shanghai Municipal Education Commission
2017-2018

Microsoft (United States)
2018

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training encoder of on unlabeled multilingual dataset 12 million (M) hours spanning over 300 languages, and fine-tuning smaller labeled dataset. use with random-projection quantization speech-text modality matching to achieve state-of-the-art performance downstream ASR speech-to-text translation tasks. also demonstrate despite...

10.48550/arxiv.2303.01037 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities.Self-supervised learning signals aims learn the latent structure inherent in signal, while attempts capture lexical information.Learning aligned unpaired sequences is challenging task.Previous work either implicitly enforced these two modalities be space through multitasking parameter sharing or explicitly conversion of via synthesis.While former suffers interference between...

10.21437/interspeech.2022-10937 article EN Interspeech 2022 2022-09-16

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic (ASR). Permutation invariant training (PIT) a state art model-based approach, which applies single neural network to solve this single-input, multiple-output modeling problem. We propose advance current by imposing modular structure on network, applying progressive pretraining regimen, and improving objective function with transfer learning discriminative criterion. The splits problem into...

10.1109/taslp.2017.2765834 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-10-23

Speech synthesis has advanced to the point of being close indistinguishable from human speech. However, efforts train speech recognition systems on synthesized utterances have not been able show that data can be effectively used augment or replace In this work, we demonstrate promoting consistent predictions in response real and enables significantly improved performance. We also find training 460 hours LibriSpeech augmented with 500 transcripts (without audio) performance is within 0.2% WER...

10.1109/icassp40776.2020.9053831 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training modalities, rather than pre-training fine-tuning. In addition, JOIST using streaming E2E order of magnitude more data, which are also novelties compared works. Through series ablation studies, different types text modeling, including how the length sequence appropriate subword unit...

10.1109/slt54892.2023.10022774 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

End-to-end modeling (E2E) of automatic speech recognition (ASR) blends all the components a traditional system into single, unified model. Although it simplifies ASR systems, model is hard to adapt when training and testing data mismatches. In this work, we focus on contextual recognition, which particularly challenging for E2E models because information only available in inference time. To improve performance presence during training, propose use class-based language (CLM) that can populate...

10.1109/icassp.2019.8683573 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Connectionist temporal classification (CTC) has recently shown improved performance and efficiency in automatic speech recognition. One popular decoding implementation is to use a CTC model predict the phone posteriors at each frame then perform Viterbi beam search on modified WFST network. This still within traditional synchronous framework. In this paper, peaky posterior property of carefully investigated it found that ignoring blank frames will not introduce additional errors. Based...

10.1109/taslp.2016.2625459 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-11-04

End-to-end (E2E) automatic speech recognition (ASR) systems directly map acoustics to words using a unified model. Previous works mostly focus on E2E training single model which integrates acoustic and language into whole. Although benefits from sequence modeling simplified decoding pipelines, large amount of transcribed data is usually required, traditional modelling techniques cannot be utilized. In this paper, novel modular framework ASR proposed separately train neural models during...

10.1109/icassp.2018.8461361 preprint EN 2018-04-01

We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from co-trained connectionist temporal classification (CTC) model. made key assumption that if an encoder embedding frame is classified as blank by CTC model, it likely this will be aligned for all partial alignments or hypotheses in RNN-T can discarded decoder input. also show reduction operation applied middle encoder, which result significant speed up...

10.1109/icassp49357.2023.10096065 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster detection research, we the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed process...

10.48550/arxiv.2501.03805 preprint EN arXiv (Cornell University) 2025-01-07

An ideal multimodal agent should be aware of the quality its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware speech they process. This limitation arises because evaluation is typically excluded from multi-task training due lack suitable datasets. To address this, we introduce first natural language-based corpus, generated authentic human ratings. In...

10.48550/arxiv.2501.17202 preprint EN arXiv (Cornell University) 2025-01-27

10.1109/icassp49660.2025.10890560 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10889444 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

10.1109/icassp49660.2025.10890312 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during from two different modalities: speech and text. The proposed method, tts4pretrain complements the power contrastive learning in self-supervision with linguistic/lexical derived synthesized speech, effectively untranscribed unspoken Lexical encoder is enforced through an additional sequence loss term that coupled pretraining....

10.1109/asru51503.2021.9688018 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021-12-13

An effective way to learn representations from untranscribed speech and unspoken text with linguistic/lexical derived synthesized was introduced in tts4pretrain [1]. However, the learned real are likely be different, potentially limiting improvements incorporating text. In this paper, we introduce learning supervised earlier on training process consistency-based regularization between speech. This allows for better of shared representations. Thus, a new objective, encoder decoder consistency...

10.1109/icassp43922.2022.9746475 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that modality-matched joint speech and text model introduced in [1] can be leveraged to train massively multilingual ASR without any supervised (manually transcribed) for some languages. This paper explores the use jointly learnt representations multilingual, zero speech, real-world setting expand set languages covered by with only...

10.1109/slt54892.2023.10022791 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

This paper proposes Virtuoso, a massively multilingual speech–text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing TTS typically supports tens of languages, which are small fraction the thousands languages in world. One difficulty to scale hundreds is collecting high-quality paired data low-resource languages. study extends Maestro, pretraining automatic speech recognition (ASR), generation tasks. To train model from various types and text data,...

10.1109/icassp49357.2023.10095702 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic (ASR). The can be modularized into three sub-problems: frame-wise interpreting, sequence level speaker tracing and recognition. Nevertheless, previous acoustic models formulate correlation between sequential labels implicitly, which limit modeling effect. In this work, we include explicit for label during training. This relevant to given by both feature output last frame. Moreover, propose...

10.1109/icassp.2018.8461939 article EN 2018-04-01

10.21437/interspeech.2016-831 article EN Interspeech 2022 2016-08-29
Coming Soon ...