Research Area:  Machine Learning
Describing images through natural language is a challenging task in the field of computer vision. Image captioning consists of creating image descriptions that can be accomplished via deep learning architectures that use convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, traditional RNNs encounter problems such as exploding and vanishing gradients, and they exhibit poor performance when generating non-descriptive sentences. To solve these issues, we proposed a model based on the encoder–decoder structure using CNNs to extract the image features and multimodal gated recurrent units (GRU) for descriptions. This model implements the part-of-speech (PoS) and likelihood function for weight generation in the GRU. The method performs knowledge transfer during a validation phase that uses the k-nearest neighbors technique (kNN). Experimental results using the Flickr30k and MSCOCO datasets demonstrated that the proposed PoS-based model presents competitive scores in comparison to state-of-the-art models. The system predicts more descriptive captions and closely approximates the expected captions both in the predicted and kNN selected captions.
Keywords:  
natural language
image captioning
convolutional neural network
recurrent neural networks
captions
Author(s) Name:  Tiago do Carmo Nogueira, Cassio Dener Noronha Vinhal, Gelson da Cruz Junior, Matheus Rudolfo Diedrich Ullmann
Journal name:  Multimedia Tools and Applications Article
Conferrence name:  
Publisher name:  Springer
DOI:  10.1007/s11042-020-09539-5
Volume Information:  Volume 79