ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Speaker Conditional Sinc-Extractor for Personal VAD

En-Lun Yu, Kuan-Hsun Ho, Jeih-weih Hung, Shih-Chieh Huang, Berlin Chen

This study explores Sinc-convolution's novel application in Personal Voice Activity Detection (PVAD). The Sinc-Extractor (SE) network, developed for PVAD, learns cutoff frequencies and band gains of sinc functions to extract acoustic features. Additionally, the speaker conditional SE (SCSE) module incorporates speaker information from high-dimensional d-vectors into low-dimensional acoustic features. SE-PVAD and Vanilla PVAD have similar model size and computing load, while SCSE-PVAD is more compact with shorter inference time as it excludes speaker embedding. Evaluated with concatenated utterances from the LibriSpeech corpus, SE-PVAD outperforms Vanilla PVAD significantly. SCSE-PVAD matches Vanilla PVAD's performance but reduces input feature dimensionality and network complexity. Thus, SCSE-PVAD can function like a typical VAD, accepting only acoustic features, making it suitable for low-resource wearable devices.