ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Effect of Loudspeaker Emitted Speech on ASR performance

Vikram C M, Sanjoy Pal, Nidhi Mantri, Gopal Kumar Agrawal

Speech signal played out from the loudspeaker is referred as loudspeaker emitted speech or loud speaker speech. Most of the automatic speech recognition (ASR) systems are trained on the natural speech signals, recorded directly from the human speakers and gives higher word error rate (WER) for the loudspeaker speech. In this paper, first, we analyzed the whisper-medium ASR performance on the loudspeaker emitted speech. Five different equalizer modes, i.e., normal, pop, rock, jazz, and classic along with the distances 0m, 3m, and 5m are considered for the study. Further, based on the spectral differences between natural and loudspeaker speech, an algorithm is proposed to generate the loudspeaker quality speech from natural speech recordings. This algorithm is used to augment the Librispeech data and used to fine-tune the whisper-medium. The fine-tuned ASR on simulated loudspeaker quality speech showed significant improvement when compared to baseline system.