ISCA Archive Odyssey 2018
ISCA Archive Odyssey 2018

How to train your speaker embeddings extractor

Mitchell Mclaren, Diego Castán, Mahesh Kumar Nandwana, Luciana Ferrer, Emre Yilmaz

With the recent introduction of speaker embeddings for text-independent speaker recognition, many fundamental questions require addressing in order to fast-track the development of this new era of technology. Of particular interest is the ability of the speaker embeddings network to leverage artificially degraded data at a far greater rate beyond prior technologies, even in the evaluation of naturally degraded data. In this study, we aim to explore some of the fundamental requirements for building a good speaker embeddings extractor. We analyze the impact of voice activity detection, types of degradation, the amount of degraded data, and number of speakers required for a good network. These aspects are analyzed over a large set of 11 conditions from 7 evaluation datasets. We lay out a set of recommendations for training the network based on the observed trends. By applying these recommendations to enhance the default recipe provided in the Kaldi toolkit, a significant gain of 13-21% on the Speakers in the Wild and NIST SRE’16 datasets is achieved.

doi: 10.21437/Odyssey.2018-46

Cite as: Mclaren, M., Castán, D., Nandwana, M.K., Ferrer, L., Yilmaz, E. (2018) How to train your speaker embeddings extractor . Proc. The Speaker and Language Recognition Workshop (Odyssey 2018), 327-334, doi: 10.21437/Odyssey.2018-46

  author={Mitchell Mclaren and Diego Castán and Mahesh Kumar Nandwana and Luciana Ferrer and Emre Yilmaz},
  title={{How to train your speaker embeddings extractor	}},
  booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2018)},