ISCA Archive Odyssey 2022
ISCA Archive Odyssey 2022

Investigation on Deep Speaker Embedding Extraction Methods for Multi-Genre Speaker Verification

Woo Hyun Kang, Jahangir Alam

In this paper, we provide description of our experimented systems on the CNCeleb dataset. The CNCeleb dataset provides a difficult set of trial that were collected from multiple genres of speech and consists of real-world adversaries, including noise, overlapped background speakers, cross-channel, and short durational test samples. In order to extract a reliable speaker embedding vector under such harsh environment, we have trained multiple systems with different training strategies and architectures. More specifically, we have experimented with not only the conventional ECAPA-TDNN or ResNet architectures, but also the recently proposed multi-stream hybrid neural network. Furthermore, we have trained the systems with speaker discriminative losses, along with a domain generalization training strategy. Our experimental results show that the hybrid architectures can effectively improve the speaker verification performance in a multi-genre scenario. Moreover, fusing different types of hybrid systems further improved the performance, which indicates that different hybrid architectures can learn complementary speaker-dependent information to each other.