- Speech Recognition and Synthesis
- Natural Language Processing Techniques
- Topic Modeling
- Speech and Audio Processing
- Music and Audio Processing
- Speech and dialogue systems
- Algorithms and Data Compression
- Image Processing and 3D Reconstruction
- DNA and Biological Computing
- Time Series Analysis and Forecasting
- Hate Speech and Cyberbullying Detection
- Anomaly Detection Techniques and Applications
- Text Readability and Simplification
- Phonetics and Phonology Research
- Domain Adaptation and Few-Shot Learning
- Intelligent Tutoring Systems and Adaptive Learning
- Organoboron and organosilicon chemistry
- Handwritten Text Recognition Techniques
- Genomics and Phylogenetic Studies
- Cellular Automata and Applications
- Advanced biosensing and bioanalysis techniques
- Network Security and Intrusion Detection
Nvidia (United States)
2024
Google (United States)
2020-2023
Shanghai Jiao Tong University
2014-2021
Institute for Language and Speech Processing
2021
Johns Hopkins University
2021
Menlo School
2019
Meta (United States)
2019
Shanghai Municipal Education Commission
2017-2018
Microsoft (United States)
2018
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training encoder of on unlabeled multilingual dataset 12 million (M) hours spanning over 300 languages, and fine-tuning smaller labeled dataset. use with random-projection quantization speech-text modality matching to achieve state-of-the-art performance downstream ASR speech-to-text translation tasks. also demonstrate despite...
We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities.Self-supervised learning signals aims learn the latent structure inherent in signal, while attempts capture lexical information.Learning aligned unpaired sequences is challenging task.Previous work either implicitly enforced these two modalities be space through multitasking parameter sharing or explicitly conversion of via synthesis.While former suffers interference between...
Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic (ASR). Permutation invariant training (PIT) a state art model-based approach, which applies single neural network to solve this single-input, multiple-output modeling problem. We propose advance current by imposing modular structure on network, applying progressive pretraining regimen, and improving objective function with transfer learning discriminative criterion. The splits problem into...
Speech synthesis has advanced to the point of being close indistinguishable from human speech. However, efforts train speech recognition systems on synthesized utterances have not been able show that data can be effectively used augment or replace In this work, we demonstrate promoting consistent predictions in response real and enables significantly improved performance. We also find training 460 hours LibriSpeech augmented with 500 transcripts (without audio) performance is within 0.2% WER...
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training modalities, rather than pre-training fine-tuning. In addition, JOIST using streaming E2E order of magnitude more data, which are also novelties compared works. Through series ablation studies, different types text modeling, including how the length sequence appropriate subword unit...
End-to-end modeling (E2E) of automatic speech recognition (ASR) blends all the components a traditional system into single, unified model. Although it simplifies ASR systems, model is hard to adapt when training and testing data mismatches. In this work, we focus on contextual recognition, which particularly challenging for E2E models because information only available in inference time. To improve performance presence during training, propose use class-based language (CLM) that can populate...
Connectionist temporal classification (CTC) has recently shown improved performance and efficiency in automatic speech recognition. One popular decoding implementation is to use a CTC model predict the phone posteriors at each frame then perform Viterbi beam search on modified WFST network. This still within traditional synchronous framework. In this paper, peaky posterior property of carefully investigated it found that ignoring blank frames will not introduce additional errors. Based...
End-to-end (E2E) automatic speech recognition (ASR) systems directly map acoustics to words using a unified model. Previous works mostly focus on E2E training single model which integrates acoustic and language into whole. Although benefits from sequence modeling simplified decoding pipelines, large amount of transcribed data is usually required, traditional modelling techniques cannot be utilized. In this paper, novel modular framework ASR proposed separately train neural models during...
We propose a novel method to accelerate training and inference process of recurrent neural network transducer (RNN-T) based on the guidance from co-trained connectionist temporal classification (CTC) model. made key assumption that if an encoder embedding frame is classified as blank by CTC model, it likely this will be aligned for all partial alignments or hypotheses in RNN-T can discarded decoder input. also show reduction operation applied middle encoder, which result significant speed up...
Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster detection research, we the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed process...
An ideal multimodal agent should be aware of the quality its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware speech they process. This limitation arises because evaluation is typically excluded from multi-task training due lack suitable datasets. To address this, we introduce first natural language-based corpus, generated authentic human ratings. In...
Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during from two different modalities: speech and text. The proposed method, tts4pretrain complements the power contrastive learning in self-supervision with linguistic/lexical derived synthesized speech, effectively untranscribed unspoken Lexical encoder is enforced through an additional sequence loss term that coupled pretraining....
An effective way to learn representations from untranscribed speech and unspoken text with linguistic/lexical derived synthesized was introduced in tts4pretrain [1]. However, the learned real are likely be different, potentially limiting improvements incorporating text. In this paper, we introduce learning supervised earlier on training process consistency-based regularization between speech. This allows for better of shared representations. Thus, a new objective, encoder decoder consistency...
Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that modality-matched joint speech and text model introduced in [1] can be leveraged to train massively multilingual ASR without any supervised (manually transcribed) for some languages. This paper explores the use jointly learnt representations multilingual, zero speech, real-world setting expand set languages covered by with only...
This paper proposes Virtuoso, a massively multilingual speech–text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing TTS typically supports tens of languages, which are small fraction the thousands languages in world. One difficulty to scale hundreds is collecting high-quality paired data low-resource languages. study extends Maestro, pretraining automatic speech recognition (ASR), generation tasks. To train model from various types and text data,...
Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic (ASR). The can be modularized into three sub-problems: frame-wise interpreting, sequence level speaker tracing and recognition. Nevertheless, previous acoustic models formulate correlation between sequential labels implicitly, which limit modeling effect. In this work, we include explicit for label during training. This relevant to given by both feature output last frame. Moreover, propose...