Visual Voice Activity Detection (VVAD) refers to the detection of speech from a video sequence by means of visual cues. VVAD provides a useful addition to auditory voice activity detection, in particular in cases involving multiple speakers or background noise. This paper focusses explicitly on the measurement of facial movements at different speeds to determine which rates of movement contribute to VVAD. Facial movements in video sequences of talking faces are measured using a spatiotemporal Gabor transform. VVAD performances based on these measurements are determined for different speeds and compared to simple frame-differencing. In addition, performances are assessed for the entire frame, the head region, and the mouth region. The results obtained reveal an elevated VVAD performance for large speeds as compared to low speeds. In addition, frame differencing performs at a level comparable to that of the spatiotemporal Gabor method at the optimal speeds.
Index Terms:visual active speech, frame differencing, Gabor transform, spatiotemporal Gabor transform