Copy the page URI to the clipboard
Gkoumas, Dimitrios; Li, Qiuchi; Lioma, Christina; Yu, Yijun and Song, Dawei
(2021).
DOI: https://doi.org/10.1016/j.inffus.2020.09.005
Abstract
Multimodal video sentiment analysis is a rapidly growing area. It combines verbal (i.e., linguistic) and non-verbal modalities (i.e., visual, acoustic) to predict the sentiment of utterances. A recent trend has been geared towards different modality fusion models utilizing various attention, memory and recurrent components. However, there lacks a systematic investigation on how these different components contribute to solving the problem as well as their limitations. This paper aims to fill the gap, marking the following key innovations. We present the first large-scale and comprehensive empirical comparison of eleven state-of-the-art (SOTA) modality fusion approaches in two video sentiment analysis tasks, with three SOTA benchmark corpora. An in-depth analysis of the results shows that the attention mechanisms are the most effective for modelling crossmodal interactions, yet they are computationally expensive. Second, additional levels of crossmodal interaction decrease performance. Third, positive sentiment utterances are the most challenging cases for all approaches. Finally, integrating context and utilizing the linguistic modality as a pivot for non-verbal modalities improve performance. We expect that the findings would provide helpful insights and guidance to the development of more effective modality fusion models.
Viewing alternatives
Download history
Metrics
Public Attention
Altmetrics from AltmetricNumber of Citations
Citations from DimensionsItem Actions
Export
About
- Item ORO ID
- 72395
- Item Type
- Journal Item
- ISSN
- 1566-2535
- Project Funding Details
-
Funded Project Name Project ID Funding Body Quantum Information Access and Retrieval Theory (QUARTZ) 721321 European Union Horizon 2020 Not Set U1636203 Natural Science Foundation of China - Keywords
- Multimodal human language understanding; Video sentiment analysis; Emotion recognition; Reproducibility in multimodal machine learning
- Academic Unit or School
-
Faculty of Science, Technology, Engineering and Mathematics (STEM) > Computing and Communications
Faculty of Science, Technology, Engineering and Mathematics (STEM) - Copyright Holders
- © 2020 Elsevier B.V.
- Depositing User
- ORO Import