LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision
Audio visual
Lira
DOI:
10.21437/interspeech.2021-1360
Publication Date:
2021-08-27T05:59:39Z
AUTHORS (5)
ABSTRACT
The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect self-supervised learning.Recent works have focused on each these modalities separately, while others attempted model both simultaneously in a cross-modal fashion.However, comparatively little been given leveraging one modality as training objective learn from other.In this work, we propose Learning visual speech Representations Audio via self-supervision (LiRA).Specifically, train ResNet+Conformer predict acoustic features unlabelled speech.We find that pre-trained can be leveraged towards word-level and sentence-level lip-reading through feature extraction fine-tuning experiments.We show our approach significantly outperforms other methods Lip Reading Wild (LRW) dataset achieves state-of-the-art performance Sentences 2 (LRS2) using only fraction total labelled data.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (27)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....