Research Area:  Machine Learning
Because the information from different modalities will complement each other when describing the same contents, multimodal information can be used to obtain better feature representations. Thus, how to represent and fuse the relevant information has become a current research topic. At present, most of the existing feature fusion methods consider the different levels of features representations, but they ignore the significant relevance between the local regions, especially in the high-level semantic representation. In this paper, a general multimodal fusion method based on the co-attention mechanism is proposed, which is similar to the transformer structure. We discuss two main issues: (1) Improving the applicability and generality of the transformer to different modal data; (2) By capturing and transmitting the relevant information between local features before fusion, the proposed method can allow for more robustness. We evaluate our model on the multimodal classification task, and the experiments demonstrate that our model can learn fused featnre representation effectively.
Keywords:  
Multimodal feature fusion
Co-attention mechanism
Transformer
Deep neural network
Machine Learning
Author(s) Name:  Pei Li; Xinde Li
Journal name:  
Conferrence name:  2020 IEEE 23rd International Conference on Information Fusion (FUSION)
Publisher name:  IEEE
DOI:  10.23919/FUSION45008.2020.9190483
Volume Information:  
Paper Link:   https://ieeexplore.ieee.org/abstract/document/9190483