ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Estimation of Listening Response Timing by Generative Model and Parameter Control of Response Substantialness Using Dynamic-Prompt-Tune

Toshiki Muromachi, Yoshinobu Kano

A spoken dialogue system is required to continuously listen to a human user for smooth conversation. We propose a method that simultaneously performs response generation and listener response timing estimation. Our proposed method estimates listener response timing by adding pseudo-samples where listener response should be irrelevant, which allows using text-only conversation dataset without audio information. Furthermore, our proposed method can control substantialness of responses by user-specified parameter integrated with the Dynamic-Prompt-Tune method, which uses prompt token embedding dynamically generated from the parameter. Our automatic and manual evaluation showed that the proposed method can generate responses with more natural timing and more in line with the response substantialness parameter compared to the baseline model.