NFDI4DS | UHH-SEMS - Publication Details

A multimodal attention fusion network with a dynamic vocabulary for TextVQA

03 medical and health sciences 0302 clinical medicine

DOI: 10.1016/j.patcog.2021.108214 Publication Date: 2021-08-19T07:58:57Z

Abstract Supplemental Material References Cited by

AUTHORS (9)

Jiajia Wu

Jun Du

Fengren Wang

Chen Yang

Xinzhe Jiang

Jinshui Hu

Bing Yin

Jianshu Zhang

Lirong Dai

ABSTRACT

Abstract Visual question answering (VQA) is a well-known problem in computer vision. Recently, Text-based VQA tasks are getting more and more attention because text information is very important for image understanding. The key to this task is to make good use of text information in the image. In this work, we propose an attention-based encoder-decoder network that combines the multimodal information of visual, linguistic, and location features together. By using the attention mechanism to focus on key features to the question, our multimodal feature fusion can provide more accurate information to improve the performance. Furthermore, we present a decoder with attention map loss, which can not only predict complex answers but also deal with a dynamic vocabulary to reduce the decoding space. Compared with softmax-based cross entropy loss which can only handle a fixed-length vocabulary, the attention map loss significantly improves the accuracy and efficiency. Our method achieved the first place of all three tasks in the ICDAR2019 robust reading challenge on scene text visual question answering (ST-VQA).

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (58)

CITATIONS (21)

EXTERNAL LINKS

OPENAIRE - Products CROSSREF - Publications

PlumX Metrics

A multimodal attention fusion network with a dynamic vocabulary for TextVQA

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....