Facial expression recognition with grid-wise attention and visual transformer
Feature Learning
Feature (linguistics)
Representation
Facial expression recognition
DOI:
10.1016/j.ins.2021.08.043
Publication Date:
2021-08-21T14:20:44Z
AUTHORS (4)
ABSTRACT
Abstract F acial E xpression R ecognition (FER) has achieved remarkable progress as a result of using C onvolutional N eural N etworks (CNN). Relying on the spatial locality , convolutional filters in CNN, however, fail to learn long-range inductive biases between different facial regions in most neural layers. As such, the performance of a CNN-based model for FER is still limited. To address this problem, this paper introduces a novel FER framework with two attention mechanisms for CNN-based models, and these two attention mechanisms are used for the low-level feature learning the high-level semantic representation , respectively. In particular, in the low-level feature learning, a grid-wise attention mechanism is proposed to capture the dependencies of different regions from a facial expression image such that the parameter update of convolutional filters in low-level feature learning is regularized. In the high-level semantic representation, a visual transformer attention mechanism uses a sequence of visual semantic tokens (generated from pyramid features of high convolutional layer blocks) to learn the global representation. Extensive experiments have been conducted on three public facial expression datasets, CK+, FER+, and RAF-DB. The results show that our FER-VT has achieved state-of-the-art performance on these datasets, especially with a 100% accuracy on CK + datasets without any extra training data.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (48)
CITATIONS (121)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....