NFDI4DS | UHH-SEMS - Publication Details

Zhehuai Chen

ORCID: 0000-0003-4400-5340

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5002433660

Research Areas

Speech Recognition and Synthesis
Natural Language Processing Techniques
Topic Modeling
Speech and Audio Processing
Music and Audio Processing
Speech and dialogue systems
Algorithms and Data Compression
Image Processing and 3D Reconstruction
DNA and Biological Computing
Time Series Analysis and Forecasting
Hate Speech and Cyberbullying Detection
Anomaly Detection Techniques and Applications
Text Readability and Simplification
Phonetics and Phonology Research
Domain Adaptation and Few-Shot Learning
Intelligent Tutoring Systems and Adaptive Learning
Organoboron and organosilicon chemistry
Handwritten Text Recognition Techniques
Genomics and Phylogenetic Studies
Cellular Automata and Applications
Advanced biosensing and bioanalysis techniques
Network Security and Intrusion Detection

Nvidia (United States)
2024

Google (United States)
2020-2023

Shanghai Jiao Tong University
2014-2021

Institute for Language and Speech Processing
2021

Johns Hopkins University
2021

Menlo School
2019

Meta (United States)
2019

Shanghai Municipal Education Commission
2017-2018

Microsoft (United States)
2018

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

OPENALEX - Publications

Yu Zhang Wei Han James Qin Yongqiang Wang Ankur Bapna and 22 more

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training encoder of on unlabeled multilingual dataset 12 million (M) hours spanning over 300 languages, and fine-tuning smaller labeled dataset. use with random-projection quantization speech-text modality matching to achieve state-of-the-art performance downstream ASR speech-to-text translation tasks. also demonstrate despite...

10.48550/arxiv.2303.01037 preprint EN other-oa arXiv (Cornell University) 2023-01-01

MAESTRO: Matched Speech Text Representations through Modality Matching

OPENALEX - Publications

Zhehuai Chen Zhang Yu Andrew E. Rosenberg Bhuvana Ramabhadran Pedro J. Moreno and 2 more

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities.Self-supervised learning signals aims learn the latent structure inherent in signal, while attempts capture lexical information.Learning aligned unpaired sequences is challenging task.Previous work either implicitly enforced these two modalities be space through multitasking parameter sharing or explicitly conversion of via synthesis.While former suffers interference between...

10.21437/interspeech.2022-10937 article EN Interspeech 2022 2022-09-16

Knowledge Distillation for Sequence Model

OPENALEX - Publications

Mingkun Huang Yongbin You Zhehuai Chen Yanmin Qian Kai Yu

10.21437/interspeech.2018-1589 article EN Interspeech 2022 2018-08-28

Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition

OPENALEX - Publications

Zhehuai Chen Jasha Droppo Jinyu Li Wayne Xiong

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic (ASR). Permutation invariant training (PIT) a state art model-based approach, which applies single neural network to solve this single-input, multiple-output modeling problem. We propose advance current by imposing modular structure on network, applying progressive pretraining regimen, and improving objective function with transfer learning discriminative criterion. The splits problem into...

10.1109/taslp.2017.2765834 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2017-10-23

Improving Speech Recognition Using Consistent Predictions on Synthesized Speech

OPENALEX - Publications

Gary Wang Andrew Rosenberg Zhehuai Chen Yu Zhang Bhuvana Ramabhadran and 2 more

Speech synthesis has advanced to the point of being close indistinguishable from human speech. However, efforts train speech recognition systems on synthesized utterances have not been able show that data can be effectively used augment or replace In this work, we demonstrate promoting consistent predictions in response real and enables significantly improved performance. We also find training 460 hours LibriSpeech augmented with 500 transcripts (without audio) performance is within 0.2% WER...

10.1109/icassp40776.2020.9053831 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020-04-09

JOIST: A Joint Speech and Text Streaming Model for ASR

OPENALEX - Publications

Tara N. Sainath Rohit Prabhavalkar Ankur Bapna Yu Zhang Zhouyuan Huo and 4 more

We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training modalities, rather than pre-training fine-tuning. In addition, JOIST using streaming E2E order of magnitude more data, which are also novelties compared works. Through series ablation studies, different types text modeling, including how the length sequence appropriate subword unit...

10.1109/slt54892.2023.10022774 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder

OPENALEX - Publications

Zhehuai Chen Mahaveer Jain Yongqiang Wang Michael L. Seltzer Christian Fuegen

End-to-end modeling (E2E) of automatic speech recognition (ASR) blends all the components a traditional system into single, unified model. Although it simplifies ASR systems, model is hard to adapt when training and testing data mismatches. In this work, we focus on contextual recognition, which particularly challenging for E2E models because information only available in inference time. To improve performance presence during training, propose use class-based language (CLM) that can populate...

10.1109/icassp.2019.8683573 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019-04-17

Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR

OPENALEX - Publications

Zhehuai Chen Mahaveer Jain Yongqiang Wang Michael L. Seltzer Christian Fuegen

10.21437/interspeech.2019-1434 article EN Interspeech 2022 2019-09-13

GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

OPENALEX - Publications

Yuchen Hu Chen Chen Chao-Han Huck Yang Ruizhe Li Dong Zhang and 2 more

10.18653/v1/2024.acl-long.5 article EN 2024-01-01

Phone Synchronous Speech Recognition With CTC Lattices

OPENALEX - Publications

Zhehuai Chen Yimeng Zhuang Yanmin Qian Kai Yu

Connectionist temporal classification (CTC) has recently shown improved performance and efficiency in automatic speech recognition. One popular decoding implementation is to use a CTC model predict the phone posteriors at each frame then perform Viterbi beam search on modified WFST network. This still within traditional synchronous framework. In this paper, peaky posterior property of carefully investigated it found that ignoring blank frames will not introduce additional errors. Based...

10.1109/taslp.2016.2625459 article EN IEEE/ACM Transactions on Audio Speech and Language Processing 2016-11-04

On Modular Training of Neural Acoustics-to-Word Model for LVCSR

OPENALEX - Publications

Zhehuai Chen Qi Liu Hao Li Kai Yu

End-to-end (E2E) automatic speech recognition (ASR) systems directly map acoustics to words using a unified model. Previous works mostly focus on E2E training single model which integrates acoustic and language into whole. Although benefits from sequence modeling simplified decoding pipelines, large amount of transcribed data is usually required, traditional modelling techniques cannot be utilized. In this paper, novel modular framework ASR proposed separately train neural models during...

10.1109/icassp.2018.8461361 preprint EN 2018-04-01

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

OPENALEX - Publications

Zhehuai Chen Andrew Rosenberg Yu Zhang Gary Wang Bhuvana Ramabhadran and 1 more

10.21437/interspeech.2020-1475 article EN Interspeech 2022 2020-10-25

Accelerating RNN-T Training and Inference Using CTC Guidance

OPENALEX - Publications

Yongqiang Wang Zhehuai Chen Chengjian Zheng Yu Zhang Wei Han and 1 more

We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from co-trained connectionist temporal classification (CTC) model. made key assumption that if an encoder embedding frame is classified as blank by CTC model, it likely this will be aligned for all partial alignments or hypotheses in RNN-T can discarded decoder input. also show reduction operation applied middle encoder, which result significant speed up...

10.1109/icassp49357.2023.10096065 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits

OPENALEX - Publications

Sung-Feng Huang Heng-Cheng Kuo Zhehuai Chen Xuesong Yang Chao-Han Huck Yang and 4 more

Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster detection research, we the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed process...

10.48550/arxiv.2501.03805 preprint EN arXiv (Cornell University) 2025-01-07

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

OPENALEX - Publications

Chen Chen Yuchen Hu Siyin Wang Helin Wang Zhehuai Chen and 3 more

An ideal multimodal agent should be aware of the quality its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware speech they process. This limitation arises because evaluation is typically excluded from multi-task training due lack suitable datasets. To address this, we introduce first natural language-based corpus, generated authentic human ratings. In...

10.48550/arxiv.2501.17202 preprint EN arXiv (Cornell University) 2025-01-27

Chain-of-Thought Prompting for Speech Translation

OPENALEX - Publications

Ke Hu Zhehuai Chen Chao-Han Huck Yang Piotr Żelasko Oleksii Hrinchuk and 3 more

10.1109/icassp49660.2025.10890560 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

OPENALEX - Publications

Ke-Han Lu Zhehuai Chen Szu‐Wei Fu Chao-Han Huck Yang Jagadeesh Balam and 3 more

10.1109/icassp49660.2025.10889444 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

EMMeTT: Efficient Multimodal Machine Translation Training

OPENALEX - Publications

Piotr Żelasko Zhehuai Chen Mengru Wang Daniel Gálvez Oleksii Hrinchuk and 5 more

10.1109/icassp49660.2025.10890312 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025-03-12

Injecting Text in Self-Supervised Speech Pretraining

OPENALEX - Publications

Zhehuai Chen Yu Zhang Andrew Rosenberg Bhuvana Ramabhadran Gary Wang and 1 more

Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during from two different modalities: speech and text. The proposed method, tts4pretrain complements the power contrastive learning in self-supervision with linguistic/lexical derived synthesized speech, effectively untranscribed unspoken Lexical encoder is enforced through an additional sequence loss term that coupled pretraining....

10.1109/asru51503.2021.9688018 article EN 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021-12-13

Tts4pretrain 2.0: Advancing the use of Text and Speech in ASR Pretraining with Consistency and Contrastive Losses

OPENALEX - Publications

Zhehuai Chen Yu Zhang Andrew Rosenberg Bhuvana Ramabhadran Pedro J. Moreno and 1 more

An effective way to learn representations from untranscribed speech and unspoken text with linguistic/lexical derived synthesized was introduced in tts4pretrain [1]. However, the learned real are likely be different, potentially limiting improvements incorporating text. In this paper, we introduce learning supervised earlier on training process consistency-based regularization between speech. This allows for better of shared representations. Thus, a new objective, encoder decoder consistency...

10.1109/icassp43922.2022.9746475 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

Maestro-U: Leveraging Joint Speech-Text Representation Learning for Zero Supervised Speech ASR

OPENALEX - Publications

Zhehuai Chen Ankur Bapna Andrew Rosenberg Yu Zhang Bhuvana Ramabhadran and 2 more

Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that modality-matched joint speech and text model introduced in [1] can be leveraged to train massively multilingual ASR without any supervised (manually transcribed) for some languages. This paper explores the use jointly learnt representations multilingual, zero speech, real-world setting expand set languages covered by with only...

10.1109/slt54892.2023.10022791 article EN 2022 IEEE Spoken Language Technology Workshop (SLT) 2023-01-09

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech

OPENALEX - Publications

Takaaki Saeki Heiga Zen Zhehuai Chen Nobuyuki Morioka Gary Wang and 4 more

This paper proposes Virtuoso, a massively multilingual speech–text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing TTS typically supports tens of languages, which are small fraction the thousands languages in world. One difficulty to scale hundreds is collecting high-quality paired data low-resource languages. study extends Maestro, pretraining automatic speech recognition (ASR), generation tasks. To train model from various types and text data,...

10.1109/icassp49357.2023.10095702 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Sequence discriminative training for deep learning based acoustic keyword spotting

OPENALEX - Publications

Zhehuai Chen Yanmin Qian Kai Yu

10.1016/j.specom.2018.08.001 article EN Speech Communication 2018-08-08

Sequence Modeling in Unsupervised Single-Channel Overlapped Speech Recognition

OPENALEX - Publications

Zhehuai Chen Jasha Droppo

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic (ASR). The can be modularized into three sub-problems: frame-wise interpreting, sequence level speaker tracing and recognition. Nevertheless, previous acoustic models formulate correlation between sequential labels implicitly, which limit modeling effect. In this work, we include explicit for label during training. This relevant to given by both feature output last frame. Moreover, propose...

10.1109/icassp.2018.8461939 article EN 2018-04-01

Phone Synchronous Decoding with CTC Lattice

OPENALEX - Publications

Zhehuai Chen Wei Deng Tao Xu Kai Yu

10.21437/interspeech.2016-831 article EN Interspeech 2022 2016-08-29

Coming Soon ...