Research Area:  Machine Learning
This study introduces an innovative Visual Question Answering (VQA) framework, TPL (Teach Prompt Learning), aimed at enhancing the integration of visual content understanding and linguistic semantic reasoning by combining advanced visual encoders and language models through the method of prompt learning. The TPL framework expands the models semantic space and enhances its understanding of visual concepts by learning continuous vectors as context words from the data. Notably, TPL demonstrates significant performance improvements on the GQA task, which requires precise visual reasoning, proving its advantages in deep visual understanding and reasoning. Experiments conducted on the widely used GQA and VQAv2 datasets show that the TPL framework surpasses existing top-performing methods, highlighting the contribution of each module to accuracy improvement. Furthermore, the study explores the potential of prompt learning to improve pre-trained visual-language models data efficiency and domain generalization abilities. Despite challenges in interpretability and sensitivity to noisy labels, the simplicity of the TPL framework provides ease of extension for future research. Overall, our work offers a new solution to the adaptability issues of visual-language models and paves the way for future research in this promising field.
Keywords:  
Author(s) Name:  Shuaiyu Zhu, Shuo Peng, Shengbo Chen
Journal name:  
Conferrence name:  ASENS 24: Proceedings of the International Conference on Algorithms, Software Engineering, and Network Security
Publisher name:  ACM Digital Library
DOI:  10.1145/3677182.3677310
Volume Information:  Volume 6, Pages 713-717, (2024)