Amazing technological breakthrough possible @S-Logix pro@slogix.in

Office Address

  • #5, First Floor, 4th Street Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 Landmark : Samiyar Madam
  • pro@slogix.in
  • +91- 81240 01111

Social List

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering - 2020

mmft-bert-multimodal-fusion-transformer-with-bert-encodings-for-visual-question-answering.jpg

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering | S-Logix

Research Area:  Machine Learning

Abstract:

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotators judgment. This set of questions helps us to study the models behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method.

Keywords:  
MultiModal Fusion Transformer
BERT
Visual Question Answering
TVQA Dataset
SOTA

Author(s) Name:  Aisha Urooj Khan, Amir Mazaheri, Niels da Vitoria Lobo, Mubarak Shah

Journal name:  Computer Vision and Pattern Recognition

Conferrence name:  

Publisher name:  arXiv:2010.14095

DOI:  10.48550/arXiv.2010.14095

Volume Information: