- Speech Recognition and Synthesis
- Voice and Speech Disorders
- Phonetics and Phonology Research
- Speech and Audio Processing
- Speech and dialogue systems
- Natural Language Processing Techniques
- Text Readability and Simplification
- Language Development and Disorders
- Topic Modeling
- Music Technology and Sound Studies
Google (United States)
2022-2025
This study investigates the performance of personalized automatic speech recognition (ASR) for recognizing disordered using small amounts per-speaker adaptation data. We trained models 195 individuals with different types and severities impairment training sets ranging in size from <1 minute to 18-20 minutes Word error rate (WER) thresholds were selected determine Success Percentage (the percentage reaching target WER) application scenarios. For home automation scenario, 79% speakers reached...
Automatic Speech Recognition (ASR) systems, despite significant advances in recent years, still have much room for improvement particularly the recognition of disordered speech. Even so, erroneous transcripts from ASR models can help people with speech be better understood, especially if transcription doesn't significantly change intended meaning. Evaluating efficacy this use case requires a methodology measuring impact errors on meaning and comprehensibility. Human evaluation is gold...
This study examines the effectiveness of automatic speech recognition (ASR) for individuals with disorders, addressing gap in performance between read and conversational ASR. We analyze factors influencing this disparity effect mode-specific training on ASR accuracy.
We developed dysarthric speech intelligibility classifiers on 551,176 disordered samples contributed by a diverse set of 468 speakers, with range self-reported speaking disorders and rated for their overall five-point scale. trained three models following different deep learning approaches evaluated them ~ 94K utterances from 100 speakers. further found the to generalize well (without training) TORGO database[1] (100% accuracy), UASpeech[2] (0.93 correlation), ALS-TDI PMP[3] (0.81 AUC)...
Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality.It has been shown that ASR models tend have much higher WER on speakers with impairments than typical English speakers.It hard determine if can be useful at such high error rates.This study investigates use of BERTScore, an evaluation for text generation, provide a more informative measure quality and usefulness.Both BERTScore were compared prediction errors manually annotated by...
Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity impairment. Classification approaches also help identify hard-to-recognize samples to teach ASR systems about variable manifestations impaired speech. Here, we develop compare different deep learning techniques classify intelligibility on selected phrases. We collected from a diverse set 661 speakers with variety self-reported disorders speaking 29 words or phrases, which...
Project Euphonia, a Google initiative, is dedicated to improving automatic speech recognition (ASR) of disordered speech. A central objective the project create large, high-quality, and diverse corpus. This report describes project's latest advancements in data collection annotation methodologies, such as expanding speaker diversity database, adding human-reviewed transcript corrections audio quality tags 350K (of 1.2M total) recordings, amassing comprehensive set metadata (including more...
Project Euphonia, a Google initiative, is dedicated to improving automatic speech recognition (ASR) of disordered speech. A central objective the project create large, high-quality, and diverse corpus. This report describes project's latest advancements in data collection annotation methodologies, such as expanding speaker diversity database, adding human-reviewed transcript corrections audio quality tags 350K (of 1.2M total) recordings, amassing comprehensive set metadata (including more...
This study investigates the impact of integrating a dataset disordered speech recordings ($\sim$1,000 hours) into fine-tuning near state-of-the-art ASR baseline system. Contrary to what one might expect, despite data being less than 1% training system, we find considerable improvement in recognition accuracy. Specifically, observe 33% on prompted speech, and 26% newly gathered spontaneous, conversational speech. Importantly, there is no significant performance decline standard benchmarks....
We introduce a large language model (LLM) capable of processing speech inputs and show that tuning it further with reinforcement learning on human preference (RLHF) enables to adapt better disordered than traditional fine-tuning. Our method replaces low-frequency text tokens in an LLM's vocabulary audio the recognize by fine-tuning transcripts. then use RL rewards based syntactic semantic accuracy measures generalizing LLM speech. While resulting does not outperform existing systems for...
We developed dysarthric speech intelligibility classifiers on 551,176 disordered samples contributed by a diverse set of 468 speakers, with range self-reported speaking disorders and rated for their overall five-point scale. trained three models following different deep learning approaches evaluated them ~94K utterances from 100 speakers. further found the to generalize well (without training) TORGO database (100% accuracy), UASpeech (0.93 correlation), ALS-TDI PMP (0.81 AUC) datasets as...
Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality. It has been shown that ASR models tend have much higher WER on speakers with impairments than typical English speakers. hard determine if can be useful at such high error rates. This study investigates use of BERTScore, an evaluation for text generation, provide a more informative measure quality and usefulness. Both BERTScore were compared prediction errors manually annotated by...
This study investigates the performance of personalized automatic speech recognition (ASR) for recognizing disordered using small amounts per-speaker adaptation data. We trained models 195 individuals with different types and severities impairment training sets ranging in size from <1 minute to 18-20 minutes Word error rate (WER) thresholds were selected determine Success Percentage (the percentage reaching target WER) application scenarios. For home automation scenario, 79% speakers reached...
Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity impairment. Classification approaches also help identify hard-to-recognize samples to teach ASR systems about variable manifestations impaired speech. Here, we develop compare different deep learning techniques classify intelligibility on selected phrases. We collected from a diverse set 661 speakers with variety self-reported disorders speaking 29 words or phrases, which...