Say Who You Want to Hear: Leveraging TTS Style Embeddings for Text-Guided Speech Extraction
Akam Rahimi, Triantafyllos Afouras, Andrew Zisserman
We introduce TextSep, a novel single-channel speech separation framework that leverages free-form textual description of a speaker’s voice to guide separation from noisy multi-speaker audio mixtures, without relying on enrolment audio, images, or video. Building on advances in text-to-speech (TTS), we invert the Parler-TTS pipeline to extract rich style embeddings from the earliest cross-modal layer, enabling speech separation directly from natural language descriptions. Our main contributions are: (1) Curating a large pair of text description and clean-audio pairs (2) identifying and utilizing the projected key vectors of Parler-TTS as effective style embeddings via a lightweight wrapper; (3) integrating these embeddings into a transformer based architecture as prefix tokens and through FiLM modulation of encoder activations; and (4) demonstrating that TextSep achieves competitive performance on synthetic benchmarks, without requiring any reference audio or visual cues.