Voice Conversion With Just Nearest Neighbors

Baas, Matthew; van Niekerk, Benjamin; Kamper, Herman

doi:10.21437/Interspeech.2023-419

Voice Conversion With Just Nearest Neighbors

Matthew Baas, Benjamin van Niekerk, Herman Kamper

Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity – making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods. Code, samples, trained models: https://bshall.github.io/knn-vc.

doi: 10.21437/Interspeech.2023-419

Cite as: Baas, M., van Niekerk, B., Kamper, H. (2023) Voice Conversion With Just Nearest Neighbors. Proc. INTERSPEECH 2023, 2053-2057, doi: 10.21437/Interspeech.2023-419

@inproceedings{baas23_interspeech,
  author={Matthew Baas and Benjamin {van Niekerk} and Herman Kamper},
  title={{Voice Conversion With Just Nearest Neighbors}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={2053--2057},
  doi={10.21437/Interspeech.2023-419},
  issn={2308-457X}
}