Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding

Scene graph
DOI: 10.1145/3604557 Publication Date: 2023-06-14T11:26:25Z
ABSTRACT
Visual grounding is an essential task in understanding the semantic relationship between given text description and target object image. Due to innate complexity of language rich context image, it still a challenging problem infer underlying perform reasoning objects image expression. Although existing visual methods have achieved promising progress, cross-modal mapping across different domains for not well handled, especially when expressions are complex long. To address issue, we propose language-guided residual graph attention network (LRGAT-VG), which enables us apply deeper convolution layers with assistance connections them. This allows better handle long than other graph-based methods. Furthermore, Language-guided Data Augmentation (LGDA), based on copy-paste operations pairs source images increase diversity training data while maintaining With extensive experiments three benchmarks, including RefCOCO, RefCOCO+, RefCOCOg, LRGAT-VG LGDA achieves competitive performance state-of-the-art network-based referring expression approaches demonstrates its effectiveness.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (85)
CITATIONS (6)