ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

VoiceNoNG: Robust High-Quality Speech Editing Model without Hallucinations

Sung-Feng Huang, Heng-Cheng Kuo, Zhehuai Chen, Xuesong Yang, Pin-Jui Ku, Ante Jukic, Huck Yang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee, Szu-Wei Fu

Voicebox and VoiceCraft are the current most representative models for non-autoregressive and autoregressive speech editing, respectively. Although both of them can generate high-quality speech edits, we identify their limitations: Voicebox is not good at editing speech with background audio, while VoiceCraft suffers from the hallucination-like problem. To maintain speech quality for varying audio scenarios and address the hallucination issue, we introduce VoiceNoNG, which combines the strengths of both model frameworks. VoiceNoNG utilizes a latent flow-matching framework to model the pre-quantization features of a neural codec. The vector quantizer in the neural codec provides additional robustness against minor prediction errors from the editing model, which enables VoiceNoNG to achieve state-of-the-art performance in both objective and subjective evaluations under diverse audio conditions. Audio examples and ablations can be found on the demo page1 and supplementary material2.