This paper demonstrates how feedback from a speech recognizer can be leveraged to improve Voice Activity Detection (VAD) for online speech recognition. First, reliably transcribed segments of audio are fed back by the recognizer as supervision for VAD model adaptation. This allows the much stronger LVCSR acoustic models to be harnessed without adding computation. Second, when to make a VAD decision is dictated by the recognizer not the VAD module, allowing an implicit dynamic look-ahead for VAD. This improves robustness but can be gracefully reduced to meet latency requirements if necessary without requiring retraining/retuning of the VAD module. Experiments on telephone conversations yielded a 6.7% abs. reduction in frame classification error rate when feedback was applied to HMM-based VAD and a 4.2% abs. reduction over the best baseline system. Furthermore, a 3.0% abs. WER reduction was achieved over the best baseline in speech recognition experiments.
Index Terms: voice activity detection (VAD), speech segmentation, speech recognition