ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

PitchFlow: adding pitch control to a Flow-matching based TTS model

Tasnima Sadekova, Mikhail Kudinov, Vadim Popov, Assel Yermekova, Artem Khrapov

In recent years, there have been various attempts to improve denoising diffusion probabilistic models and make them more suitable for real-world applications. One of the recent advances in this research direction is a flow-matching models framework which has already shown good results in image and speech generation tasks. Despite high quality and generation speed, flow-matching text-to-speech models still have problems with stability and control. To mitigate this issue, we propose two techniques: speaker scoring and pitch guidance allowing to control timbre and pitch contour of the generated speech. We show that the optimal choice of the prior leads to considerable improvement of similarity and a specific design of classifier guidance allows for fine-grained pitch control with high naturalness. Moreover, these techniques may be used to implement a voice conversion system of a competitive quality.