Research Area:  Machine Learning
Image captioning is one of the most challenging tasks in AI because it requires an understanding of both complex visuals and natural language. Because image captioning is essentially a sequential prediction task, recent advances in image captioning have used reinforcement learning (RL) to better explore the dynamics of word-by-word generation. However, the existing RL-based image captioning methods rely primarily on a single policy network and reward function-an approach that is not well matched to the multi-level (word and sentence) and multi-modal (vision and language) nature of the task. To solve this problem, we propose a novel multi-level policy and reward RL framework for image captioning that can be easily integrated with RNN-based captioning models, language metrics, or visual-semantic functions for optimization. Specifically, the proposed framework includes two modules: 1) a multi-level policy network that jointly updates the word- and sentence-level policies for word generation; and 2) a multi-level reward function that collaboratively leverages both a vision-language reward and a language-language reward to guide the policy. Furthermore, we propose a guidance term to bridge the policy and the reward for RL optimization. The extensive experiments on the MSCOCO and Flickr30k datasets and the analyses show that the proposed framework achieves competitive performances on a variety of evaluation metrics. In addition, we conduct ablation studies on multiple variants of the proposed framework and explore several representative image captioning models and metrics for the word-level policy network and the language-language reward function to evaluate the generalization ability of the proposed framework.
Keywords:  
Author(s) Name:  Ning Xu; Hanwang Zhang; An-An Liu; Weizhi Nie; Yuting Su; Jie Nie; Yongdong Zhang
Journal name:  IEEE Transactions on Multimedia
Conferrence name:  
Publisher name:  IEEE
DOI:  10.1109/TMM.2019.2941820
Volume Information:  Volume: 22, Issue: 5, May 2020, Page(s): 1372 - 1383
Paper Link:   https://ieeexplore.ieee.org/abstract/document/8844130