NFDI4DS | UHH-SEMS - Publication Details

A comparison of end-to-end models for long-form speech recognition

End-to-end principle

DOI: 10.48550/arxiv.1911.02242 Publication Date: 2019-01-01

Abstract Supplemental Material References Cited by

AUTHORS (14)

Chung‐Cheng Chiu

Wei Han

Zhang Yu

Ruoming Pang

Sergey Kishchenko

Patrick Nguyen

Arun Narayanan

Hank Liao

Shuyuan Zhang

Anjuli Kannan

Rohit Prabhavalkar

Zhifeng Chen

Tara N. Sainath

Yonghui Wu

ABSTRACT

End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems. However, previous studies focused primarily on short utterances that typically last for just a few seconds or, at most, tens of seconds. Whether such architectures are practical long from minutes hours remains an open question. In this paper, we investigate improve end-to-end long-form transcription. We first present empirical comparison different real world task demonstrate RNN-T model is much more robust than systems in regime. next explore two improvements significantly its performance: restricting attention be monotonic, applying novel decoding algorithm breaks into shorter overlapping segments. Combining these improvements, show can very competitive recognition.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

A comparison of end-to-end models for long-form speech recognition

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....