ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

A Transformer-Based Voice Activity Detector

Biswajit Karan, Joshua Jansen van Vüren, Febe de Wet, Thomas Niesler

Voice activity detection (VAD) is the task of distinguishing speech from other types of audio signals, such as music or background noise. We introduce a novel end-to-end VAD architecture which incorporates a pre-trained transformer model (Wav2Vec2-XLS-R). We evaluate the proposed architecture on an established VAD dataset, AVA-Speech, and a manually-segmented corpus of under-resourced multilingual speech. As benchmarks, we include a hybrid CNN-BiLSTM system and an off-the-shelf enterprise VAD. On the AVA-Speech test set, our proposed VAD achieves an area under the curve (AUC) of 96.2% while the benchmarks achieve 94.8% and 81.9% respectively. On the multilingual dataset, the gap widens to 92.2% for the transformer-based VAD and 80.8% and 74.6% for the two baselines. Therefore, the proposed VAD offers improved performance in all cases, with an absolute increase of more than 11% for our target domain. We conclude that the proposed end-to-end architecture improves VAD performance.