Positional Attention Guided Transformer-like Architecture for Visual Question Answering
Published in IEEE Transactions on Multimedia, 2022
Abstract: Transformer architectures have recently been introduced into the field of visual question answering (VQA), due to their powerful capabilities of information extraction and fusion. However, existing Transformer-like models, including models using a single Transformer structure and large-scale pre-training generic visual-linguistic models, do not fully utilize both positional information of words in questions and positional information of objects in images, which are shown in this paper to be crucial in VQA tasks. To address this challenge, we propose a novel positional attention guided Transformer-like architecture, which can adaptively extracts positional information within and across the visual and language modalities, and use this information to guide high-level interactions in inter- and intra-modality information flows. In particular, we design and assemble three positional attention modules into a single Transformer-like model MCAN. We show that the positional information introduced in intra-modality interaction can adaptively modulate inter-modality interaction according to different inputs, which plays an important role for visual reasoning. Experimental results demonstrate that our model outperforms the state-of-the-art models and is particularly good at handling object counting questions. Overall, our model achieves the accuracy of 70.10%, 71.27%, 71.52% on the datasets of COCO-QA, VQA v1.0 test-std and VQA v2.0 test-std, respectively. The source code will be publicly available at https://github.com/waizei/PositionalMCAN.
Recommended citation: Aihua Mao, Zhi Yang, Ken Lin, Jun Xuan, Yong-Jin Liu*. Positional Attention Guided Transformer-like Architecture for Visual Question Answering. IEEE Transactions on Multimedia, 2022, doi: 10.1109/TMM.2022.3216770.