Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-Based LVCSR

FOS: Computer and information sciences Sound (cs.SD) 03 medical and health sciences Audio and Speech Processing (eess.AS) FOS: Electrical engineering, electronic engineering, information engineering 0305 other medical science Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
DOI: 10.21437/interspeech.2020-2556 Publication Date: 2020-10-27T05:22:11Z
ABSTRACT
Transformer has shown impressive performance in automatic speech recognition.It uses an encoder-decoder structure with self-attention to learn the relationship between high-level representation of source inputs and embedding target outputs.In this paper, we propose a novel decoder that features self-and-mixed attention (SMAD) deep acoustic (DAS) improve Transformer-based LVCSR.Specifically, introduce mechanism multi-layer for multiple levels abstraction.We also design mixed learns alignment different abstraction its corresponding linguistic information simultaneously shared space.The ASR experiments on Aishell-1 show proposed achieves CERs 4.8% dev set 5.1% test set, which are best reported results task our knowledge.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (5)