Research Area:  Machine Learning
In recent years, multimodal machine translation has become one of the hot research topics. In this paper, a machine translation model based on self-attention mechanism is extended for multimodal machine translation. In the model, an Image-text attention layer is added in the end of encoder layer to capture the relevant semantic information between image and text words. With this layer of attention, the model can capture the different weights between the words that is relevant to the image or appear in the image, and get a better text representation that fuses these weights, so that it can be better used for decoding of the model. Experiments are carried out on the original English-German sentence pairs of the multimodal machine translation dataset, Multi30k, and the Indonesian-Chinese sentence pairs which is manually annotated by human. The results show that our model performs better than the text-only transformer-based machine translation model and is comparable to most of the existing work, proves the effectiveness of our model.
Keywords:  
Multimodal Machine Translation
Image-text attention
Transformer-based
Self-attention
Machine Learning
Deep Learning
Author(s) Name:   Junteng Ma; Shihao Qin; Lan Su; Xia Li; Lixian Xiao
Journal name:  
Conferrence name:  2019 International Conference on Asian Language Processing (IALP)
Publisher name:  IEEE
DOI:  10.1109/IALP48816.2019.9037732
Volume Information:  
Paper Link:   https://ieeexplore.ieee.org/abstract/document/9037732