Research Topics for Multimodal Deep Learning

   Multi-modal deep learning (MMDL) is referred to the learning approach of deep neural networks to extract the feature from the multiple data modalities. The main significance of multi-modal deep learning is to build the models to process and represent the information using various modalities. Multiple types of modalities used in MMDL are image, text, video, audio, body gestures, facial expressions, and physiological signals. Some of the advantages of deep multi-modal learning are modality wise and shared representations are learned, little or no processing of input data, immediate fusion, and intense computation.
    Multi-modal deep learning architecture is catheterized as Probabilistic graphic models: it includes Restricted Boltzmann machine (RBM), Deep Belief Networks (DBN), Deep Boltzmann Machines (DBM), and Variational Auto-Encoders., Artificial neural network: it contains basic architectures such as Convolutional Neural Network(CNN), Recurrent Neural Network(RNN) and You Only Look Once (YOLO), Miscellaneous architecture: some models include Support Vector Machine, Generative Adversarial Network, and Hidden Markov model. The other model used for MMDL is hybrid models which are classified as joint methods, iterative methods, and staged methods.
    Application of deep multi-modal learning are Speech classification, Image annotation Content-based medical image retrieval, Semantic segmentation Action recognition, Emotion and action recognition, Medical diagnosis, Sentiment analysis, Robotic grasping, Driver activity anticipation, and many more. The recent advances and trends of MMDL are from Audio-visual speech recognition (AVSR), multi-modal emotion recognition, multi-modal event detection, image and video captioning, Visual Question-Answering(VQA), to multimedia retrieval, and so on.