ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Application for Real-time Audio-Visual Speech Enhancement

Mandar Gogate, Kia Dashtipour, Amir Hussain

This short paper demonstrates a first of its kind audio-visual (AV) speech enhancement (SE) desktop application that isolates, in real-time, the voice of a target speaker from noisy audio input. The deep neural network model integrated in this application exploits the AV nature of speech from the target speaker to suppress all speech and non-speech background sounds. In the context of a growing need for video conferencing solutions, AV SE enables the practical deployment such technology in challenging acoustic environments with multiple competing background noise sources. In these scenarios, classical audio-only SE typically fails as they are usually trained to isolate speech from non-speech noises. The application comprises a graphical user interface and modules for real-time AV speech acquisition, preprocessing, and enhancement. The participants will experience a significant improvement in the speech quality and intelligibility of a target speaker who will be physically situated in a real noisy environment with a range of real-world noises. Moreover, participants can evaluate the performance of the application with their own voice by recording videos in challenging multi-talker conversational environments.