ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian

This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We adopt the Lowrank Adaptation (LoRA) technique to swiftly tailor a pretrained language model to our needs, facilitating the extraction of speakerrelated traits from the prompt text. Besides, different from other promptdriven texttospeech (TTS) systems, we separate the prompttospeaker module from the multispeaker TTS system, enhancing system flexibility and compatibility with various pretrained multispeaker TTS systems. Moreover, for the prompt-to-speaker characteristic module, we also compared the discriminative method and flow-matching based generative method and we found that combining both methods can help the system simultaneously capture speaker-related information from prompts better and generate speech with higher fidelity.