ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Bridging the Training–Inference Gap in TTS: Training Strategies for Robust Generative Postprocessing for Low-Resource Speakers

Frank Zalkow, Paolo Sani, Kishor Kayyar Lakshminarayana, Emanuël A. P. Habets, Nicola Pia, Christian Dittmar

Modern text-to-speech synthesis systems usually consist of an acoustic model generating speech features, e.g., mel spectrograms, and a vocoder converting them into speech waveforms. The vocoder is typically trained with ground-truth features but receives features from the acoustic model during inference, leading to a mismatch between training and inference. To address this issue, previous work proposed employing generative postprocessing models to make the synthetic features appear more natural. While such systems can produce speech nearly indistinguishable from real speech when sufficient training data is available, their performance degrades with limited data. To mitigate this limitation, we propose a training data generation procedure using a subsampling strategy and multiple acoustic models. We evaluate it through listening tests, demonstrating consistent improvements in the naturalness of the synthetic speech across different postprocessing models and low-resource target speakers.