ISCA Archive Odyssey 2022
ISCA Archive Odyssey 2022

Hybrid Neural Network-Based Deep Embedding Extractors for Text-Independent Speaker Verification

Jahangir Alam, Woo Hyun Kang, Abderrahim Fathan

In this contribution, we propose a multi-stream hybrid neural network for extracting speaker discriminant utterance-level embedding vectors. In this approach, an input acoustic feature frame is processed in multiple parallel pipelines where each stream has a unique dilation rate for incorporating diversity of temporal resolution in embedding processing. In order to aggregate the speaker information within short time-span and utterance-level context, proposed extractor employs multi-level global-local statistics pooling. In addition, we also propose an ensemble embedding extractor that employs both a hybrid neural network (HNN) and an extended time delay neural network - Long short-term memory (ETDNN-LSTM) hybrid modules for including diversified temporal resolution and for capturing complementarity. In order to evaluate the proposed systems, a set of experiments on the CNCeleb corpus were conducted, and the proposed multi-stream hybrid network outperformed the conventional approaches trained on the same dataset. The ensemble approach is found to provide the best results in terms of all considered evaluation metrics.