ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer

Nobukatsu Hojo, Saki Mizuno, Satoshi Kobashikawa, Ryo Masumura, Mana Ihori, Hiroshi Sato, Tomohiro Tanaka

This study investigates praise estimation, the task of estimating the existence of preferable behaviors of a speaker in a conversational video. To estimate praises from multimodal information, considering synchronized behavior across modalities is important. Such cross-modal synchronization can be modeled by the conventional multimodal Transformer in a time-axis concatenation architecture because it models relevance between all time steps of all input modalities using attention matrices. However, the attention matrices are so high-dimensional that the model training can be difficult with a limited amount of training data. To alleviate this problem, we propose introducing a loss function representing the prior knowledge that the attention should link around the synchronized time steps across the input modalities. Our experiments on a business negotiation conversation corpus showed that the proposed method could improve the praise estimation's macro F1.