ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

NAM-to-Speech Conversion with Multitask-Enhanced Autoregressive Models

Neil Shah, Shirish Karande, Vineet Gandhi

We propose an alignment-free, end-to-end Non-Audible Murmur (NAM)-to-Speech conversion model. Existing methods rely on large NAM-text pairs per speaker to generate high-quality alignments for training non-autoregressive models. However, alignment quality deteriorates when trained on multi-speaker data, limiting their ability to generalize and effectively utilize the available training data. To address this, we introduce a streamlined autoregressive approach that eliminates the need for explicit alignment learning. By leveraging multi-speaker samples, synthetic training pairs, and multitask character recognition training, our method reduces the word error rate (WER) by 59.19% compared to the state-of-the-art (SOTA) on two public datasets. We demonstrate the model’s zero-shot capability and validate the effectiveness of multitask training through ablation studies.