The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023
DOI:
10.48550/arxiv.2401.06788
Publication Date:
2024-01-01
AUTHORS (5)
ABSTRACT
This paper delineates the visual speech recognition (VSR) system introduced by NPU-ASLP-LiAuto (Team 237) in first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging fixed and open tracks of Single-Speaker VSR Task, track Multi-Speaker Task. In terms data processing, we leverage lip motion extractor from baseline1 to produce multi-scale video data. Besides, various augmentation techniques are applied during training, encompassing speed perturbation, random rotation, horizontal flipping, color transformation. The model adopts an end-to-end architecture with joint CTC/attention loss, comprising a ResNet3D frontend, E-Branchformer encoder, Transformer decoder. Experiments show that our achieves 34.76% CER for Task 41.06% after multi-system fusion, ranking place all three participate.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....