On Vision Features in Multimodal Machine Translation

On Vision Features in Multimodal Machine Translation - 2022

Research Paper On Vision Features in Multimodal Machine Translation

Research Area: Machine Learning

Abstract:

Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the visual modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased.

Keywords:
Vision Features
Multimodal Machine Translation
Image captioning
Machine Learning

Author(s) Name: Bei Li, Chuanhao Lv, Zefan Zhou, Tao Zhou, Tong Xiao, Anxiang Ma, JingBo Zhu

Journal name: Computation and Language

Conferrence name:

Publisher name: arXiv:2203.09173

DOI: 10.48550/arXiv.2203.09173

Volume Information:

Paper Link: https://arxiv.org/abs/2203.09173

Office Address

Social List