Information fusion in visual question answering: A Survey

Information fusion in visual question answering: A Survey - 2019

Survey paper on Information fusion in visual question answering

Research Area: Machine Learning

Abstract:

Visual question answering automatically answers natural language questions according to the content of an image or video. The task is challenging because it requires the understanding of semantic information in the textual and visual channels, as well as their interplay. A typical solver is composed of three components: feature extraction from singular modality, feature fusion between visual and textual channels, and answer prediction based on the learnt joint representation. Among them, information fusion plays a key role in enhancing the overall accuracy and various types of approaches have been proposed, such as simple vector operators, deep neural networks, bilinear pooling, attention mechanisms, and memory networks. The primary objective of this survey is to provide a clear organization and comprehensive review on the ever-proposed fusion techniques in the domain of visual question answering. We propose an abstract fusion framework that can fit the majority of existing VQA models, making it convenient for readers to quickly understand their key contributions. Finally, we summarize the effective fusion strategies that have been widely adopted so as to benefit readers in their model design.

Keywords:
Information fusion
Visual question answering
feature extraction
Machine Learning
Deep Learning

Author(s) Name: Dongxiang Zhang, Rui Cao, Sai Wu

Journal name: Information Fusion

Conferrence name:

Publisher name: Elsevier

DOI: 10.1016/j.inffus.2019.03.005

Volume Information: Volume 52, December 2019, Pages 268-280

Paper Link: https://www.sciencedirect.com/science/article/abs/pii/S1566253518308893

Office Address

Social List