Exploiting the Social-Like Prior in Transformer for Visual Reasoning
DOI:
10.1609/aaai.v38i3.27977
Publication Date:
2024-03-25T09:15:37Z
AUTHORS (6)
ABSTRACT
Benefiting from instrumental global dependency modeling of self-attention (SA), transformer-based approaches have become the pivotal choices for numerous downstream visual reasoning tasks, such as question answering (VQA) and referring expression comprehension (REC). However, some studies recently suggested that SA tends to suffer rank collapse thereby inevitably leads representation degradation transformer layer goes deeper. Inspired by social network theory, we attempt make an analogy between behavior regional information interaction in SA, harness two crucial notions structural hole degree centrality explore possible optimization towards learning, which naturally deduces plug-and-play social-like modules. Based on hole, former module allows more structured, effectively avoids redundant aggregation feature homogenization better remedy, followed latter comprehensively characterize refine discrimination via considering regions transitivity relations. Without bells whistles, our model outperforms a bunch baselines noticeable margin when prior five benchmarks VQA REC series explanatory results are showcased sufficiently reveal behaviors SA.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (3)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....