ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Investigating continuous autoregressive generative speech enhancement

Haici Yang, Gordon Wichern, Ryo Aihara, Yoshiki Masuyama, Sameer Khurana, François G. Germain, Jonathan Le Roux

Following the success of autoregressive (AR) language models in predicting discrete tokens, it has become common practice for autoregressive audio and speech models to use discrete tokens generated by a neural audio codec. However, recent work has demonstrated that replacing discrete token probability modeling in an AR model with a continuous diffusion procedure can improve both model performance and efficiency for image generation. In this paper, we explore applying such a diffusion loss to replace discrete token modeling in an AR generative speech enhancement model. We explore several important design choices, including comparing standard AR models with masked AR models, and mel spectrograms with learned latents as the continuous feature representation. Our results demonstrate the potential of continuous AR speech enhancement, particularly in cases of severe noise.