Research Area:  Machine Learning
Due to the various patterns of the image and free-form language of the question, the performance of Visual Question Answering (VQA) still lags behind satisfaction. Existing approaches mainly infer answers from the low-level features and sequential question words, which neglects the syntactic structure information of the question sentence and its correlation with the spatial structure of the image. To address these problems, we propose a novel VQA model, i.e., Attention-based Syntactic Structure Tree-LSTM (ASST-LSTM). Specifically, a tree-structured LSTM is used to encode the syntactic structure of the question sentence. A spatial-semantic attention model is proposed to learn the visual-textual correlation and the alignment between image regions and question words. In the attention model, Siamese network is employed to explore the alignment between visual and textual contents. Then, the tree-structured LSTM and the spatial-semantic attention model are integrated with a joint deep model, in which the multi-task learning method is used to train the model for answer inferring. Experiments conducted on three widely used VQA benchmark datasets demonstrate the superiority of the proposed model compared with state-of-the-art approaches.
Keywords:  
Visual question answering
Attention model
Siamese network
Attention-based Syntactic Structure Tree-LSTM
Deep Learning
Author(s) Name:  Yun Liu, Xiaoming Zhang, Feiran Huang, Xianghong Tang, Zhoujun Li
Journal name:  Applied Soft Computing
Conferrence name:  
Publisher name:  Elsevier
DOI:  10.1016/j.asoc.2019.105584
Volume Information:  Volume 82, September 2019, 105584
Paper Link:   https://www.sciencedirect.com/science/article/abs/pii/S1568494619303643