ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Enhancing Visual Question Answering via Deconstructing Questions and Explicating Answers

Feilong Chen, Minglun Han, Jing Shi, Shuang Xu, Bo Xu

A compositional question refers to a question that involves multiple visual objects, as well as their attributes and relationships, which requires compositional reasoning to answer. Existing VQA models can well answer a compositional question, but few works can give the reasoning process and explain why this answer is given. In this paper, we propose a novel model (DEEX) to enhance visual question answering via DEconstructing questions and EXplicating answers when answering compositional questions. Specifically, DEEX aims to accomplish three sub-tasks: (1) Compositional Question Answering (CQA), (2) Question Deconstructing (QD), and (3) Answer Explicating (AE). We utilize prompt-based multi-task learning to train the proposed DEEX to be able to answer questions and give explanations simultaneously. Experimental results on the GQA dataset demonstrate our method’s effectiveness, which can enhance visual question answering by giving corresponding reasoning processes and explanations.