ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition

Seong-Hu Kim, Yong-Hwa Park

In text-independent speaker recognition, each speech is composed of different phonemes depending on spoken text. The conventional neural networks for speaker recognition are static models, so they do not reflect this phoneme-varying characteristic well. To tackle this limitation, we propose an adaptive convolutional neural network (ACNN) for text-independent speaker recognition. The utterance is divided along the time axis into short segments with small fluctuating phonemes. Frame-level features are extracted by applying input-dependent kernels adaptive to each segment. By applying time average pooling and linear layers, utterance-level embeddings extraction and speaker recognition are performed. Adaptive VGG-M using 0.356 seconds segmentation shows better speaker recognition performance than baseline models, with a Top-1 of 86.51% and an EER of 5.68%. It extracts more accurate frame-level embeddings for vowel and nasal phonemes compared to the conventional method without overfitting and large parameters. This framework for text-independent speaker recognition effectively utilizes phonemes and text-varying characteristic of speech.