A comparison of end-to-end models for long-form speech recognition
End-to-end principle
DOI:
10.48550/arxiv.1911.02242
Publication Date:
2019-01-01
AUTHORS (14)
ABSTRACT
End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems. However, previous studies focused primarily on short utterances that typically last for just a few seconds or, at most, tens of seconds. Whether such architectures are practical long from minutes hours remains an open question. In this paper, we investigate improve end-to-end long-form transcription. We first present empirical comparison different real world task demonstrate RNN-T model is much more robust than systems in regime. next explore two improvements significantly its performance: restricting attention be monotonic, applying novel decoding algorithm breaks into shorter overlapping segments. Combining these improvements, show can very competitive recognition.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....