ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

wTIMIT2mix: A Cocktail Party Mixtures Database to Study Target Speaker Extraction for Normal and Whispered Speech

Marvin Borsdorf, Zexu Pan, Haizhou Li, Tanja Schultz

Target speaker extraction (TSE) seeks to single out a target speaker's voice from a given speech mixture signal with the help of a target reference signal. This algorithm enables novel speech applications such as smart hearing aids. A TSE system has to work reliably in any everyday conversational situation. This may also include speakers who switch naturally between normal and whispered speech modes. This work represents the first attempt to perform TSE for whispered speech. For this, we construct a new first of its kind database, called wTIMIT2mix, which comprises two-speaker speech mixtures and target speaker reference signals given in both normal and whispered speech modes. Our results on TSE show that if these conditions are included in the training, a model can be equipped to work under all closed-set conditions.