ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Multi-channel separation of dynamic speech and sound events

Takuya Fujimura, Robin Scheibler

We propose a multi-channel separation method for moving sound sources. We build upon a recent beamformer for a moving speaker using attention-based tracking. This method uses an attention mechanism to compute the time-varying spatial statistics which enables tracking the moving source. While this prior work aimed to extract a single target source, we simultaneously estimate multiple sources. Our main technical contribution is to introduce attention-based tracking into the iterative source steering algorithm for independent vector analysis (IVA), enabling joint estimation of multiple sources. We experimentally show that the proposed method greatly improves the separation performance for moving speakers, including an absolute reduction of 27.2% in word error rate compared to time-invariant IVA. In addition, we demonstrate that the proposed method is effective as a pre-processing for sound event detection, showing an improvement in F1 scores of up to 4.7% in real recordings.