VLCA: Vision-Language Aligning Model with Cross-Modal Attention

VLCA: Vision-Language Aligning Model with Cross-Modal Attention for Bilingual Remote Sensing Image Captioning - 2023

Research Paper on VLCA: Vision-Language Aligning Model with Cross-Modal Attention for Bilingual Remote Sensing Image Captioning

Research Area: Machine Learning

Abstract:

In the field of satellite imagery, remote sensing image captioning (RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote (DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention (VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a cross-modal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-to-end Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.

Keywords:

Author(s) Name: Tingting Wei; Weilin Yuan; Junren Luo; Wanpeng Zhang; Lina Lu

Journal name: Systems Engineering and Electronics

Conferrence name:

Publisher name: IEEE

DOI: 10.23919/JSEE.2023.000035

Volume Information: Volume: 34, Pages: 9 - 18, (2023)

Paper Link: https://ieeexplore.ieee.org/document/10066217

Office Address

Social List