List of Topics:
Location Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

Multimodal attention with image text spatial relationship for ocr-based image captioning - 2020


Multimodal attention with image text spatial relationship | S-Logix

Research Area:  Machine Learning

Abstract:

OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the image. Such problems are not yet fully investigated in existing research. In this paper, we present a novel design - Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens. Extensive experiments conducted on the TextCaps dataset verify the effectiveness of the proposed MMA-SR method. More remarkably, our MMA-SR increases CIDEr-D score from 93.7% to 98.0%.

Keywords:  
image captioning
visual object
information
multimodal attention captioner

Author(s) Name:  Jing Wang, Jinhui Tang, Jiebo Luo

Journal name:  

Conferrence name:  MM 20: Proceedings of the 28th ACM International Conference on Multimedia

Publisher name:  ACM

DOI:  10.1145/3394171.3413753

Volume Information:  -