Deep Learning for Military Image Captioning

Research Area: Machine Learning

Abstract:

US Department of Defense (DoD) big data is extensively multimodal and multiple intelligence (multi-INT), where structured sensor and unstructured audio, video and textual ISR (Intelligence, Surveillance, and Reconnaissance) data are generated by numerous air, ground, and space borne sensors along with human intelligence. Data fusion at all levels “remains a challenging task.” While there are algorithmic stove-piped systems that work well on individual modalities, there is no system to date that is mission and source agnostics and can seamlessly integrate and correlate multi-INT data that includes textual, hyperspectral, and video content. The considerable volume and velocity aspects of big data only compound the aforementioned encountered in fusion. We have developed the concept of “deep fusion” 1 based on deep learning models adapted to process multiple modalities of big data. Rather than reducing each modality independently and fusing at a higher-level model (feature-level fusion), the deep fusion approach generates a set of multimodal features, thereby maintaining the core properties of the dissimilar signals and resulting in fused models of higher accuracy. We have initiated two deep fusion experiments - one is to automatically generate the caption of an image to help analysts tagging and captioning large volumes of images gathered from collection platforms, and the other is an audio-visual speech classification with potential applications to lip-reading and enhanced object tracking. This paper presents the proof-of-concept demonstration for caption generation. The generative model is based on a deep recurrent architecture combined with the pre-trained image-to-vector model Inception V3 via a Convolutional Neural Network (CNN) and the word-to-vectors model word2vec via a skip-gram model. We make use of the Flickr8K dataset extended with some military specific images to make the demonstration more relevant to the DoD domain. The detailed results from the image captioning experiment is presented here. The captions are generated from test image are subjectively evaluated and the BLEU (bilingual evaluation understudy) scores are compared and found substantial improvements.

Keywords:
Deep Learning
Image Captioning
Machine Learning

Author(s) Name: Subrata Das; Lalit Jain; Arup Das

Journal name:

Conferrence name: 21st International Conference on Information Fusion (FUSION)

Publisher name: IEEE

DOI: 10.23919/ICIF.2018.8455321

Volume Information:

Paper Link: https://ieeexplore.ieee.org/abstract/document/8455321

Office Address

Social List