NFDI4DS | UHH-SEMS - Publication Details

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

FOS: Computer and information sciences Sound (cs.SD) 03 medical and health sciences Computer Science - Computation and Language Audio and Speech Processing (eess.AS) FOS: Electrical engineering, electronic engineering, information engineering 0305 other medical science Computation and Language (cs.CL) Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing

DOI: 10.21437/interspeech.2022-7 Publication Date: 2022-09-16T15:42:06Z

Abstract Supplemental Material References Cited by

AUTHORS (10)

Naoyuki Kanda

Jian Wu

Yu Wu

Xiong Xiao

Zhong Meng

Xiaofei Wang

Yashesh Gaur

Zhuo Chen

Jinyu Li

Takuya Yoshioka

ABSTRACT

6 pages, 1 figure, 7 tables, v2: minor fixes, v3: Appendix D has been added, v4: citation to [27] has been added, v5: citations to [28][29][30] have been added with minor fixes, short version accepted for presentation at Interspeech 2022<br/>This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (28)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products

PlumX Metrics

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....