ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Attention-augmented X-vectors for the Evaluation of Mimicked Speech Using Sparse Autoencoder-LSTM framework

Bhasi K. C., Rajeev Rajan, Noumida A

This paper evaluates the quality of mimicked speech by computing the speaker embeddings. We propose an attention-augmented encoded speaker embedding for mimicking speaker evaluation. X-vector embeddings extracted from the spectral features are passed through a 1-D convolutional neural network (CNN) with an attention module. The resulting output is fed into a sparse autoencoder. Later, the encoded vector is fed to a long short-term memory (LSTM) -based scoring mechanism. The best mimicking artist is initially identified by a perception test. Later, the we investigate whether the LSTM-based self-attention model predicts the same artist. When the model identifies the mean opinion score(MOS)-identified artist with the highest probability (rank-1), we assume that one hit occurs. The performance evaluation is carried out with a mimicry dataset using top-X criteria. The experiment demonstrates efficacy in the proposed vector representation in competency evaluation of voice mimicking.