ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme

Hieu-Thi Luong, Junichi Yamagishi

As prompt-based generative models have received much attention, many studies have proposed a similar model for sound generation. While prompt-based generative models have an intuitive interface for non-professional users to experiment with, they lack the ability to control the generated sounds via a more direct means. In this work, we investigated the use of a simple segment-based labeling scheme for human vocalization generation, which is a specific subset of sound generation. By conditioning the generative models on the label sequence which marks the vocalization class of the segment, the generated sound can be controlled in a more detailed manner while maintaining a simple and intuitive input interface. Our experiments showed that simply switching the label scheme from global to segment-based does not degrade the quality of the generated samples in any way and provides a new method of controlling the generation process.