ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

SpeechSEC: A Unified Multi-Task Framework for Speech Synthesis, Editing, and Continuation

Liming Liang, Dongchao Yang, Xianwei Zhuang, Yuxin Xie, Luo Chen, Yuehan Jin, Yuexian Zou

Recent advancements in non-autoregressive single-task speech synthesis have garnered significant attention. However,traditional single-task speech synthesis methods focus primarily on mapping semantic tokens to acoustic tokens, which overlooking the internal relationships within acoustic features. Addressing this gap, we propose SpeechSEC, a unified multi-task framework designed for Speech Synthesis, Editing, and Continuation tasks by dynamically adjusting input conditions. SpeechSEC not only surpasses previous state-of-the-art method in audio quality (4.20 vs 4.00), and voice preservation (0.72 vs 0.58) for synthesis task by acquiring shared knowledge, but also efficiently executes editing and continuation tasks with good performance via non-autoregressive techniques. Additionally, SpeechSEC exhibits a strong adaptability to current speech discretization methods, like Hubert, Descript-Audio-Codec and SpeechTokenizer, which showcases robustness of our approach. Audio samples are available.