This study investigates the task of estimating the engagement of a target participant from video and audio during a multi-person conversation. For this task, interaction should be modeled effectively, considering the redundancy of video and audio across frames among multiple participants. Conventional Transformer-based methods in multimodal sentiment analysis succeeded in such efficient modeling by constraining the at- tention across multimodal data streams to go through only a small set of latent fusion units (“global tokens”) that form an attention bottleneck. However, performance can be limited in the multi-person model because it needs to model interaction among a larger number of data streams based on only a single global token sequence. To address this problem, we propose a participant-pair-wise bottleneck transformer (PPBT) that involves multiple global token sequences, each of which is dedicated to a particular pair of participants and demonstrates its effect.