Research Area:  Machine Learning
With the recent development of deep learning, AI models are widely used in various domains. AI models show good performance for definite tasks such as image classification and text generation. With the recent development of generative models (e.g., BigGAN, GPT-3), AI models also show impressive results for diverse generation tasks (e.g., photo-realistic image, paragraph generation). As the performance of each AI model improves, interest in comprehensive tasks, such as visual language navigation (VLN) which follows the language instruction with an egocentric view, is also growing. However, the model integration for VLN has a problem due to the model complexity, modal heterogeneity, and paired data shortage. This study provides a comprehensive survey on VLN with a systemic approach for reviewing recent trends. At first, we define a taxonomy for fundamental techniques which need to perform VLN. We analyze from four perspectives of VLN: representation learning, reinforcement learning, component, and evaluation. We investigate the pros and cons of each component and methodology that have been conducted recently. This survey categorizes major research institute-s approaches with taxonomy defined in four perspectives, unlike other conventional surveys. Finally, we discuss current open challenges and conclude our study by giving possible future directions.
Keywords:  
Artificial intelligence
Visual language navigation
Representation learning
Reinforcement learning
Machine Learning
Deep Learning
Author(s) Name:  Sang-Min Park & Young-Gab Kim
Journal name:  Artificial Intelligence Review
Conferrence name:  
Publisher name:  Springer
DOI:  10.1007/s10462-022-10174-9
Volume Information:  volume 56, pages: 365–427
Paper Link:   https://link.springer.com/article/10.1007/s10462-022-10174-9