In the era of artificial intelligence, connecting visuals and language facilitates a significant role. In such a way, a wide range of research efforts have been carried out in image captioning that divulges the visual content of the images in natural language. Image captioning exploits visual understanding system and language model, beneficial to generate purposeful and linguistically perfect sentences. Image captioning is mainly used for computer vision and natural language concepts.
Automatic image captioning is a challenging problem with numerous applications in a variety of research areas. There have been essential advancements in image captioning techniques due to the innovation and expansion of deep learning. Deep learning technology owns the capability to handle complexities and challenges in image captioning effectively.
Automatic image captioning applications are human-computer interaction, medical image captioning and automatic medical prescription, quality control in industry, traffic data analysis, assistive technologies for visually impaired, biomedicine, commerce, military, education, digital libraries web searching and social media. Image captioning visual encoding approaches are Non-Attentive, Additive Attention, Graph-based Attention, and Self-Attention.
Popular language models in image captioning are LSTM-based, CNN-based, Transformer-based and Image-Text Early Fusion. Cross-Entropy Loss, Masked Language Model, Reinforcement Learning and VL Pre-Training are some training strategies. Image captioning restricts with a set of problems such as Object Hallucination, Exploding Gradient Problem, Vanishing Gradient Problem, Loss-Evaluation Mismatch Problem and Exposure Bias Problem.
Some of the deep learning based frequently used in most image captioning methods are Region-based CNNs (R-CNN), Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRU), Long Short-Term Memory Networks, and Residual neural networks (ResNet). Trending deep learning-based image captioning techniques are highlighted below:
• Visual space and multimodal space-based image captioning method: Deep learning helps to generate captions from both visual and multimodal space. Recently, RNN models have been applied for multimodal space-based image captioning.
• Attention-based Image Captioning: Attention-based methods have shown better performance in image captioning while using deep learning models. CNN and encoder-decoder-based approaches are majorly used.
• Dense Captioning: Dense Captioning utilizes different regions of the image to determine the caption of various objects, and recently Fast R-CNN has been a deep learning model employed for dense captioning.
• Semantic Concept-Based Image Captioning: Semantic concept-based methods specifically attend to a set of semantic concept captions extracted from the image. CNN, `RNN, LSTM, CNN-LSTM, Semantic Compositional Network (SCN), Skel-LSTM, and Attr-LSTM are the recent deep learning models used for image captioning for the semantic concept.
• Novel Object-based Image Captioning: Novel object-based image captioning methods can generate descriptions of new objects within the context. Deep Compositional Captioner and Novel Object Captioner are the recent deep learning-based novel object image captioning.
• Stylized Caption: Deep learning-based stylized caption method focusing on stylized part of the text from other linguistic patterns. Currently, CNN, LSTM, and combined CNN-RNN models are exploited for stylized caption generation. Some other deep learning networks-based captions for the whole scene are Encoder-Decoder architecture, Compositional architecture.