ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

A comparison of voice similarity through acoustics, human perception and deep neural network (DNN) speaker verification systems

Suyuan Liu, Molly Babel, Jian Zhu

Voice similarity can be assessed through acoustic analysis, perceptual judgments by human listeners, and the recent addition of automatic speaker verification systems. However, a comparison across the similarity judgments made from acoustics, listener perception, and deep neural network (DNN) based speaker verification systems has not yet been made. This project fills this gap by comparing acoustic similarity scores generated from 24 acoustic dimensions and verification scores generated by seven pretrained speaker verification models using the Wespeaker toolkit to perceptual similarity assessed by human listeners in an AX discrimination task and a (dis)similarity rating task. Results suggest verification similarities correlate with acoustic similarities, but not with human perceptual similarities when controlled for talker pair, indicating the correlation between listeners and speaker verification models happens at a gross-phonetic level rather than a fine phonetic level.