The objective of this paper is ‘open-set’ speaker recognition
of unseen speakers, where ideal embeddings should be able to condense
information into a compact utterance-level representation that has
small intra-speaker and large inter-speaker distance.
A popular belief in
speaker recognition is that networks trained with classification objectives
outperform metric learning methods. In this paper, we present an extensive
evaluation of most popular loss functions for speaker recognition on
the VoxCeleb dataset. We demonstrate that the vanilla triplet loss
shows competitive performance compared to classification-based losses,
and those trained with our proposed metric learning objective outperform
state-of-the-art methods.