ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Unified Audio-Visual Modeling for Recognizing Which Face Spoke When and What in Multi-Talker Overlapped Speech and Video

Naoki Makishima, Naotaka Kawata, Taiga Yamane, Mana Ihori, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura

We have developed a model that jointly recognizes which face spoke “when” and “what” from multi-talker overlapped speech and video of multiple speakers. For understanding of videos in which multiple speakers are speaking, it is important to recognize which face spoke “when” and “what” from the overlapped speech and multiple speakers’ videos. Conventional methods have to combine speech separation, active speaker detection, and audio-visual speech recognition to address this task. However, the combined system makes the system complex and suboptimal. To address this problem, our idea is to serialize “which face spoke when and what” of multiple speakers into a single token sequence, which is recursively estimated with the proposed unified model using multiple speakers’ videos and overlapped speech as input. Experimental results demonstrate the validity of the proposed method.