ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Audio Editing with Non-Rigid Text Prompts

Francesco Paissan, Luca Della Libera, Zhepei Wang, Paris Smaragdis, Mirco Ravanelli, Cem Subakan

In this paper, we explore audio editing with non-rigid text prompts via Latent Diffusion Models. Our methodology is based on carrying out a fine-tuning step on the latent diffusion model, which increases the overall faithfulness of the generated edits to the input audio. We quantitatively and qualitatively show that our pipeline obtains results which outperform current state-of-the-art neural audio editing pipelines for addition, style transfer, and inpainting. Through a user study, we show that our method results in higher user preference compared to several baselines. We also show that the produced edits obtain better trade-offs in terms of fidelity to the text prompt and to the input audio compared to the baselines. Finally, we benchmark the impact of LoRA to improve editing speed while maintaining edits quality.