ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Towards Robust Speaker Recognition against Intrinsic Variation with Foundation Model Few-shot Tuning and Effective Speech Synthesis

Zhiyong Chen, Shuhang Wu, Xinnuo Li, Zhiqi Ai, Shugong Xu

Speaker recognition is essential for secure authentication and personalized voice assistants in smart home settings, but it faces challenges due to intrinsic speaker variability, such as aging and emotional fluctuations. Existing methods often rely on pretraining and require extensive data. To address these challenges, we propose a framework for time-varying and emotion-robust open-set identification (OSI) for smart home environments, utilizing few-shot foundation enrollment-time tuning and style-rich zero-shot text-to-speech (TTS) systems. We explore best practices for synthetic data selection and suitable open-set outlier-focused loss functions. Our proposed method improves handling emotional and aging variations in target speakers, enhancing robustness to intrinsic variability while maintaining resilience to unknown outliers. Experiments demonstrate strong generalization across multiple time-varying and emotionally rich benchmarks.