Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Feature (linguistics) Feature Learning Word error rate Representation
DOI: 10.1007/s11063-024-11614-z Publication Date: 2024-05-08T12:01:46Z
ABSTRACT
Abstract To address the challenges of poor representation capability and low data utilization rate end-to-end speech recognition models in deep learning, this study proposes an model based on multi-scale feature fusion multi-view self-supervised learning (MM-ASR). It adopts a multi-task paradigm for training. The proposed method emphasizes importance inter-layer information within shared encoders, aiming to enhance model’s characterization via module. Moreover, we apply effectively exploit information. Our approach is rigorously evaluated Aishell-1 dataset further validated its effectiveness English corpus WSJ. experimental results demonstrate noteworthy 4.6 $$\%$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mo>%</mml:mo> </mml:math> reduction character error rate, indicating significantly improved performance . These findings showcase potential our MM-ASR tasks.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (64)
CITATIONS (4)