ISCA Archive CHiME 2023
ISCA Archive CHiME 2023

Multi-stage diarization refinement for the CHiME-7 DASR scenario

Christoph Boeddeker, Tobias Cord-Landwehr, Thilo von Neumann, Reinhold Haeb-Umbach

This submission for the CHiME-7 DASR challenge consists of a TS-VAD system for diarization followed by a GSS system for source extraction. Then, a segment-level refinement is applied to the enhanced audio segments, before using the baseline ASR system for transcribing the audio. As initialization for the TS-VAD, the baseline diarization system was used to identify single-speaker regions that are used to extract enrollment embeddings for each speaker in a meeting. The TS-VAD system is applied on each microphone channel independently, and the soft estimates at the TS-VAD output are averaged across the microphones, before converting them to hard estimates, i.e., the diarization estimates Additionally, we analyzed the estimates and found many speaker swaps and less ideal segments. To address them, we propose a simple post-processing step by comparing speaker embeddings from the baseline diarization, i.e., the enrollment embeddings, with speaker embeddings derived from the enhanced data. Through the usage of TS-VAD, we improve upon the baseline word error rate on the CHiME-6 dataset by 3.6 percentage points, whereas the postprocessing results in an additional consistent word error rate improvement of 2 % to 4 % absolute.