ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Text Only Domain Adaptation with Phoneme Guided Data Splicing for End-to-End Speech Recognition

Wei Wang, Xun Gong, Hang Shao, Dongning Yang, Yanmin Qian

Adaptation of end-to-end (E2E) automatic speech recognition (ASR) models to unseen domains remains a challenge due to their monolithic construction, which typically necessitates paired data for customization. While neural text-to-speech (TTS) approaches have shown effectiveness for domain adaptation, they come with the drawback of high computational costs during training and inference. In this paper, we propose a model-free audio synthesis pipeline for domain adaptation, which synthesizes audio with text from the target domain and audio pieces from the source domain, allowing ASR models to be adapted with the on-the-fly synthesized audio. Additionally, we apply layer-wise regularization between speech encodings generated by adapted and unadapted models to prevent overfitting. Our experiments adapt from LIBRI SPEECH to various domains in GIGA SPEECH. The results show a 15-30% relative improvement in target domains compared to shallow fusion, with almost no degradation in the source domain.