Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Foundation (evidence)
DOI: 10.48550/arxiv.2405.14161 Publication Date: 2024-05-23
ABSTRACT
We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we a novel indicator that empirically integrates step-wise information during assess token-level quality pseudo labels without ground truth, thereby guiding model updates effective adaptation. Experimental results show achieves average 13.5% relative reduction word error rate across 14 it sometimes even approaches upper-bound performance supervised Surprisingly, also observe prevents adapted from common catastrophic forgetting problem recalling source-domain data. Furthermore, exhibits high efficiency only requires less than one-hour data, seamless generality alternative large translation tasks. Our code aims open source research communities.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....