ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Using speech synthesis to explain automatic speaker recognition: a new application of synthetic speech

Georgina Brown, Christin Kirchhübel, Ramiz Cuthbert

Some speech synthesis systems make use of zero-shot adaptation to generate speech based on a target speaker. These systems produce speaker embeddings in the same way that speaker embeddings (often called 'x-vectors') are produced in automatic speaker recognition systems. This commonality between the two technologies could lower barriers that constrain the use of automatic speaker recognition systems in forensic speech analysis casework. A key barrier to the use of automatic speaker recognition in the forensic context is the issue of explainability, including what information about the voice a system uses in order to arrive at conclusions. This paper sets out a new approach that could be used to effectively communicate this type of information to audiences in the legal setting. Specifically, it is proposed that exposing listeners to synthetic speech produced by a zero-shot adaptation system could illustrate what aspects of the voice an automatic speaker recognition system captures.