ISCA Archive CHiME 2023
ISCA Archive CHiME 2023

Multimodal and Large-Scale Generative Models for Enhancement

Wei-Ning Hsu

What is the goal of speech enhancement, and what is the definition of a perfectly enhanced speech sample? Conventional speech enhancement usually concerns additive noise and treats the source speech as the single oracle which an enhancement model should reconstruct exactly. Performance is often measured by signal-level metrics like SDR and PESQ. The paradigm has two main issues. First, why should we consider the source speech as oracle? Those references might also contain some noise and might not be recorded with the best-quality microphone. Should a model be penalized when generating enhanced speech that “sounds better” than the reference speech? Second, even when the reference speech is of superior quality, there could still be multiple samples that sound identical to the reference to humans yet being very different from the reference in the waveform space (e.g., time shift, phase shift). Should a model be penalized if it generates one of those samples that sounds just as good as the reference?

In this talk, I will present two recent studies on generative modeling with applications to the “generalized speech enhancement” problem. The goal of generalized speech enhancement is to ensure the desired factors, such as content and voice, are preserved/enhanced, instead of reconstructing the source speech exactly. The first study is ReVISE, which leverages AV-HuBERT and HiFi-GAN to build a universal model for lip-to-speech synthesis, audio-visual speech inpainting, enhancement, and separation. By using a pre-trained model, ReVISE can operate in very challenging (ego-centric, low resolution, low SNR) and low-resource (2hr) regimes effectively. The second study is Voicebox, a DALL-E and LLM like speech generative model that can perform in-context learning and generalize to monolingual/cross-lingual style transfer, speech editing, and unconditional diverse speech sampling. In particular, we demonstrate one of its applications to transient noise removal through in-context infilling.