ISCA Archive CHiME 2024
ISCA Archive CHiME 2024

The SEUEE System for the CHiME-8 MMCSG Challenge

Cong Pang, Feifei Xiong, Ye Ni, Lin Zhou, Jinwei Feng

In this paper, we describe the proposed system SEUEE and our extension work designed for Task 3 of the CHiME-8 Challenge: Multi-modal Conversations in Smart Glasses (MMCSG). To reduce the word error rate (WER) of speaker-attributed transcriptions in a streaming environment, we propose a causal multichannel directional speech extraction (DSE) framework by separating the speech of the wearer and the conversational partner from the mixed audio, with each output fed into separately adapted automatic speech recognition (ASR) engine. Acknowledging that the deep learning based framework could introduce distortions, we improve the training mechanism by incorporating a pre-trained universal model guided by the target speaker voice activity detection, as well as a composite loss to better preserve the speech component. Moreover, we conduct extension research and investigation of such DSE system by exploiting the explicit spatial information derived from the microphone array geometry, and the implicit spatial information learnt from a dedicated narrow-band network. In addition to the signal-based loss functions, we further introduce a loss inspired by the ASR phoneme mismatch to guide the framework training towards distortion-less target speech signals. Evaluated on the MMCSG evaluation set, our submitted SEUEE and the further proposed DSE system outperform the baseline by an absolute WER reduction of 3.0% and 0.2% for the wearer speech, and 4.3% and 2.0% for the partner speech, respectively.