ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Adapter-Based Extension of Multi-Speaker Text-To-Speech Model for New Speakers

Cheng-Ping Hsieh, Subhankar Ghosh, Boris Ginsburg

Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However, this approach has some challenges. Usually, fine-tuning requires several hours of high quality speech per speaker. Fine-tuning might negatively affect the quality of speech synthesis for previously learned speakers. In this paper, we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few adapter modules are added between the layers of the pretrained network. The pretrained model is frozen, and only the adapters are fine-tuned to the speech of a new speaker. Our approach will produce a new model with a high level of parameter sharing with the original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate our adapter-based method through objective and subjective metrics. The code is open-sourced and the audio samples are available on our demo page.