ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Participant-Pair-Wise Bottleneck Transformer for Engagement Estimation from Video Conversation

Keita Suzuki, Nobukatsu Hojo, Kazutoshi Shinoda, Saki Mizuno, Ryo Masumura

This study investigates the task of estimating the engagement of a target participant from video and audio during a multi-person conversation. For this task, interaction should be modeled effectively, considering the redundancy of video and audio across frames among multiple participants. Conventional Transformer-based methods in multimodal sentiment analysis succeeded in such efficient modeling by constraining the at- tention across multimodal data streams to go through only a small set of latent fusion units (“global tokens”) that form an attention bottleneck. However, performance can be limited in the multi-person model because it needs to model interaction among a larger number of data streams based on only a single global token sequence. To address this problem, we propose a participant-pair-wise bottleneck transformer (PPBT) that involves multiple global token sequences, each of which is dedicated to a particular pair of participants and demonstrates its effect.