ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

TargetVoice: Single Channel Low-Latency Target Speaker Extraction

Arun Kumar Pallala, Nivedita Chennupati, Balaji Padmanaban, Rakesh Pogula, Uma Subhashini Ravuri, Naveen Ellanki, Harish Rajamani, Naveen Ambati

We present TargetVoice, a lightweight, low-latency target speaker extraction (TSE) model optimized for edge devices. It isolates a target speaker’s voice from multi-speaker and noisy environments, making it ideal for use in call centers, conference calls, hands-free communication, and smart speakers. By streaming only the enrolled speaker’s voice, TargetVoice also improves speech recognition accuracy in real-world conditions. Unlike existing models that struggle with similar-gender speakers or varying acoustic environments, TargetVoice leverages a robust in-house data strategy and a specialized speaker embedding extraction system. The model uses a compact 10MB speaker encoder to generate a reliable embedding from a single 3 second enrollment. This embedding is fused with the input mixture in a 12MB extraction block with 6G MACs to isolate the target voice efficiently, enabling real-time performance on resource-constrained platforms.