Research Area:  Machine Learning
In the field of satellite imagery, remote sensing image captioning (RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote (DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention (VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a cross-modal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-to-end Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.
Keywords:  
Author(s) Name:  Tingting Wei; Weilin Yuan; Junren Luo; Wanpeng Zhang; Lina Lu
Journal name:  Systems Engineering and Electronics
Conferrence name:  
Publisher name:  IEEE
DOI:  10.23919/JSEE.2023.000035
Volume Information:  Volume: 34, Pages: 9 - 18, (2023)
Paper Link:   https://ieeexplore.ieee.org/document/10066217