ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Extraction of interpretable and shared speaker-specific speech attributes through binary auto-encoder

Imen Ben-Amor, Jean-Francois Bonastre, Salima Mdhaffar

In speaker recognition systems, embeddings lack explicit speaker-related information, posing challenges for interpretability. Recently, a binary representation of speech extracts, where a coefficient indicates the presence or absence of a given voice attribute, has been proposed to overcome this lack. It consists of an adaptation of x-vector extractor followed by a binarisation step. This approach has proved its worth in terms of explainability, but has two shortcomings. Firstly, the objective of shared attribute modeling is indirectly taken into account. Secondly, binarization is not integrated into the modeling, but added as an afterthought. In this paper, we follow the same principle but propose a new approach that addresses the two limitations outlined above. Our proposal is based on a binary auto-encoder for restructuring conventional embeddings. The expected attribute-based behavior of the binary representation is then explicitly introduced in a new cost function. Experiments on VoxCeleb databases show the effectiveness of our proposal, with a relative reduction in EER of 47% compared to the original approach (from 3.7% to 1.96% of EER), while offering the same level of explainability.