Research Area:  Machine Learning
When available, multimodal data is key for enhanced emotion recognition in conversation. Text, audio, and video in dialogues can facilitate and complement each other in analyzing speakers emotions. However, it is very challenging to effectively fuse multimodal features to understand the detailed contextual information in conversations. In this work, we focus on dynamic interactions during the information fusion process and propose a Dynamic Interactive Multiview Memory Network (DIMMN) model to integrate interaction information for recognizing emotions. Specifically, the information fusion within DIMMN is through multiple perspectives (combining different modalities). We designed multiview layers in attention networks to enable the model to mine the crossmodal dynamic dependencies between different groups in the process of dynamic modal interaction. In order to learn the long-term dependency information, temporal convolutional networks are introduced to synthesize contextual information of a single person. Then, the gated recurrent units and memory networks are used to model the global session to detect contextual dependencies for multi-round, multi-speaker interactive emotion information. Experimental results on IEMOCAP and MELD demonstrate that DIMMN achieves better and comparable performance to the state-of-the-art methods, with an accuracy of 64.7% and 60.6%, respectively.
Keywords:  
Emotion recognition in conversation
Multimodal fusion
Dynamic interactive multiview memory network
Author(s) Name:  Jintao Wen, Dazhi Jiang, Geng Tu, Cheng Liu
Journal name:  Information Fusion
Conferrence name:  
Publisher name:  Elsevier
DOI:  10.1016/j.inffus.2022.10.009
Volume Information:  Volume 91
Paper Link:   https://www.sciencedirect.com/science/article/abs/pii/S1566253522001786