ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

A Simple-Yet-Effective Data Augmentation Method for Speaker Identification in Novels

Wenjie Zhong, Jason Naradowsky, Yusuke Miyao

Speaker identification in novels is crucial for speech synthesis systems to assign appropriate voices in audiobook production. It attributes a speaker to an utterance through context analysis. Traditional approaches heavily rely on human-annotated datasets, which are costly and scarce, limiting model performance. To overcome this, we propose a simple-yet-effective data augmentation method using large language models (LLMs) to generate synthetic dialogues and post-process the dialogues into augmented training instances. Our experiments show that this method achieves a state-of-the-art accuracy of 82.6%, surpassing the previous baseline by 2.4%. Performance gains are especially notable in the Implicit (hard) category, where our method exceeds the previous baseline by 3.5%. Our analysis suggests that it enhances the ability to capture long-term dependencies and there is a mutually reinforce effect between the Implicit and Anaphoric (middle) categories.