Image captioning is an innately complicated challenge for artificial intelligence as it combines adversities from both Computer Vision and Natural Language Processing. Owing to the remarkable success of deep learning in diverse fields, it is also effectively applied for automatic image captioning.
Application fields of image captioning are human-computer interaction, medical image captioning and automatic medical prescription, quality control in industry, traffic data analysis, assistive technologies for the visually impaired, Intelligent control systems, IoT devices, biomedicine, commerce, military, education, digital libraries web searching and social media.
Object hallucination, exploding gradient problem, vanishing gradient problem, loss-evaluation mismatch problem, and exposure bias problem are the issues of deep learning models utilized for image captioning. Region-based CNNs (R-CNN), Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRU), Long Short-Term Memory Networks, and Residual neural networks (ResNet) are the popular deep learning models employed for image captioning.
Impressive deep learning-based image captioning techniques are visual space and multimodal space-based image captioning, attention-based image captioning, dense captioning, semantic concept-based image captioning, novel object-based image captioning, stylized caption, encoder-decoder architecture-based image captioning and compositional architecture based image captioning.
Numerous literature surveys and reviews are conducted on deep learning-enabled image captioning, which presents a comprehensive overview of image captioning approaches, state-of-the-art approaches, deep learning-based image captioning techniques, technical innovations, training strategies, performances, strengths, limitations, datasets, evaluation metrics, advantages and disadvantages of different approaches, open problems, and unsolved challenges.