Research Area:  Machine Learning
Visual question answering automatically answers natural language questions according to the content of an image or video. The task is challenging because it requires the understanding of semantic information in the textual and visual channels, as well as their interplay. A typical solver is composed of three components: feature extraction from singular modality, feature fusion between visual and textual channels, and answer prediction based on the learnt joint representation. Among them, information fusion plays a key role in enhancing the overall accuracy and various types of approaches have been proposed, such as simple vector operators, deep neural networks, bilinear pooling, attention mechanisms, and memory networks. The primary objective of this survey is to provide a clear organization and comprehensive review on the ever-proposed fusion techniques in the domain of visual question answering. We propose an abstract fusion framework that can fit the majority of existing VQA models, making it convenient for readers to quickly understand their key contributions. Finally, we summarize the effective fusion strategies that have been widely adopted so as to benefit readers in their model design.
Keywords:  
Information fusion
Visual question answering
feature extraction
Machine Learning
Deep Learning
Author(s) Name:  Dongxiang Zhang, Rui Cao, Sai Wu
Journal name:  Information Fusion
Conferrence name:  
Publisher name:  Elsevier
DOI:  10.1016/j.inffus.2019.03.005
Volume Information:   Volume 52, December 2019, Pages 268-280
Paper Link:   https://www.sciencedirect.com/science/article/abs/pii/S1566253518308893