NFDI4DS | UHH-SEMS - Publication Details

Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading

lipreading FOS: Computer and information sciences Transformer Technology LRW-AR T Computer Vision and Pattern Recognition (cs.CV) Arabic language Computer Science - Computer Vision and Pattern Recognition deep learning graph neural networks Computer Science - Multimedia Multimedia (cs.MM)

DOI: 10.48550/arxiv.2402.11520 Publication Date: 2025-01-09

Abstract Supplemental Material References Cited by

AUTHORS (4)

Samar Daou

Achraf Ben-Hamadou

Ahmed Rekik

Abdelaziz Kallel

ABSTRACT

Lipreading involves recognizing spoken words by analyzing the movements of the lips and surrounding area using visual data. It is an emerging research topic with many potential applications, such as human–machine interaction and enhancing audio-based speech recognition. Recent deep learning approaches integrate visual features from the mouth region and lip contours. However, simple methods such as concatenation may not effectively optimize the feature vector. In this article, we propose extracting optimal visual features using 3D convolution blocks followed by a ResNet-18, while employing a graph neural network to extract geometric features from tracked lip landmarks. To fuse these complementary features, we introduce a cross-attention mechanism that combines visual and geometric information to obtain an optimal representation of lip movements for lipreading tasks. To validate our approach for Arabic, we introduce the first large-scale Lipreading in the Wild for Arabic (LRW-AR) dataset, consisting of 20,000 videos across 100 word classes, spoken by 36 speakers. Experimental results on both the LRW-AR and LRW datasets demonstrate the effectiveness of our approach, achieving accuracies of 85.85% and 89.41%, respectively.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....