ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations

Chau Luu, Steve Renals, Peter Bell

Deep speaker embeddings have been shown to encode a wide variety of attributes relating to a speaker. The aim of this work is to separate out some of these attributes in the embedding space, disentangling these sources of speaker variation into subsets of the embedding dimensions. This is achieved modifying the training procedure of a typical speaker embedding network, which is typically only trained to classify speakers. This work instead adds pairs of attribute specific task heads to operate on complementary subsets of the speaker embedding dimensions. While specific dimensions are encouraged to encode an attribute, for example gender, the other dimensions are penalized for containing this information using an adversarial loss. We show that this method is effective in factorizing out multiple attributes in the embedding space, successfully disentangling gender, nationality and age. Using the disentangled representations, we investigate how much removing this information impacts speaker verification and diarization performance, showing that gender is a significant source of separation in the deep speaker embedding space, with nationality and age also contributing to a lesser degree.