Adopting Whisper for Confidence Estimation
FOS: Computer and information sciences
Computer Science - Machine Learning
Audio and Speech Processing (eess.AS)
FOS: Electrical engineering, electronic engineering, information engineering
Electrical Engineering and Systems Science - Audio and Speech Processing
Machine Learning (cs.LG)
DOI:
10.48550/arxiv.2502.13446
Publication Date:
2025-02-19
AUTHORS (4)
ABSTRACT
Recent research on word-level confidence estimation for speech recognition systems has primarily focused lightweight models known as Confidence Estimation Modules (CEMs), which rely hand-engineered features derived from Automatic Speech Recognition (ASR) outputs. In contrast, we propose a novel end-to-end approach that leverages the ASR model itself (Whisper) to generate scores. Specifically, introduce method in Whisper is fine-tuned produce scalar scores given an audio input and its corresponding hypothesis transcript. Our experiments demonstrate Whisper-tiny model, comparable size strong CEM baseline, achieves similar performance in-domain dataset surpasses baseline eight out-of-domain datasets, whereas Whisper-large consistently outperforms by substantial margin across all datasets.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....