Trending Research Topics in Image Captioning using Multimodal Learning

Fine-tuning with Multi-modal Entity Prompts for News Image Captioning - 2022

Fine-tuning with Multi-modal Entity Prompts| S-Logix

Research Area: Machine Learning

Abstract:

News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.

Keywords:
News Image Captioning
Flexible Paradigm
Entity-oriented Prompt
NewsMEP

Author(s) Name: Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, Yongdong Zhang

Journal name:

Conferrence name: MM 22: Proceedings of the 30th ACM International Conference on Multimedia

Publisher name: ACM

DOI: 10.1145/3503161.3547883

Volume Information: -

Paper Link: https://dl.acm.org/doi/abs/10.1145/3503161.3547883

Office Address

Social List

Fine-tuning with Multi-modal Entity Prompts for News Image Captioning - 2022

Fine-tuning with Multi-modal Entity Prompts| S-Logix

Abstract:

S-Logix (OPC) Private Limited

Office Address

Fine-tuning with Multi-modal Entity Prompts for News Image Captioning - 2022

Fine-tuning with Multi-modal Entity Prompts| S-Logix

Abstract:

Related Papers