ISCA Archive Interspeech 2019
ISCA Archive Interspeech 2019

Target Speaker Recovery and Recognition Network with Average x-Vector and Global Training

Wenjie Li, Pengyuan Zhang, Yonghong Yan

It is very challenging to do multi-talker automatic speech recognition (ASR). Some speaker-aware selective methods have been proposed to recover the speech of the target speaker, relying on the auxiliary speaker information provided by an anchor (a clean audio sample of the target speaker). But the performance is unstable depending on the quality of the provided anchors. To address this limitation, we propose to take advantage of the average speaker embeddings to build the target speaker recovery network (TRnet). The TRnet takes the mixed speech and the stable average speaker embeddings to produce the TF masks for the target speech. During training of the TRnet, we summarize the speaker embeddings on the whole training dataset for each speaker, instead of extracting on a randomly picked anchor. On the testing stage, one or very few anchors are enough to get decent recovery results. The results of the TRnet trained with average speaker embeddings show 13% and 12.5% relative improvements on WER and SDR, compared with the short-anchor trained model. Moreover, to mitigate the mismatch between the TRnet and the acoustic model (AM), we adopted two strategies: fine-tuning the AM and training an global TRnet. Both of them bring considerable reductions on WER. The results show that the global trained framework gets superior performance.