ISCA Archive SPSC 2022
ISCA Archive SPSC 2022

Why Eli Roth should not use TTS-Systems for anonymization

Yamini Sinha, Jan Hintz, Matthias Busch, Tim Polzehl, Matthias Haase, Andreas Wendemuth, Ingo Siegert

This paper evaluates the impact of using TTS-based speaker anonymization with objective and subjective methods. A pretrained automatic speaker verification (ASV) VGGVox model (95.66% recognition rate on Voxceleb 1), enrolled with human voices, is tested on the anonymized voices obtained from eSpeak TTS, to objectively verify the anonymization. We used one of the benchmark datasets for a speaker verification task with 1,251 speakers and over 100,000 utterances, consisting of spontaneous speech called VoxCeleb1. Upon anonymizing 40 speakers from the VoxCeleb1 test dataset, the objective evaluation shows that ASV systems, if presented with synthetic speech samples, are vulnerable to false acceptance. Experimental results show that after anonymization, approximately 6% of the TTS speaker samples were falsely accepted as the counterfeited human speaker. This confusion about a TTS speaker as a human speaker may lie in the accuracy of the ASV model and the similarity metric used. Furthermore, we examined these confused speaker pairs against non-confused speaker pairs using a subjective measure (listener’s ratings) with 200 test subjects. In the subjective evaluations using a crowd-sourced platform, no significant results could be concluded, as human raters were unsure whether or not the voices were similar.