ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

FiLM Conditioning with Enhanced Feature to the Transformer-based End-to-End Noisy Speech Recognition

Da-Hee Yang, Joon-Hyuk Chang

Ensuring robustness against environmental noise is an important concern in the design of automatic speech recognition (ASR) systems. This is typically achieved by utilizing a speech enhancement (SE) network in an ASR system to boost noise robustness. The performance of ASR systems can be improved using SE networks as a front-end or by retraining the ASR system on enhanced speech. Although the SE network is effective, it does not always result in improved performance in the ASR system owing to artifacts. To address this problem, we propose the use of enhanced speech from an SE network as a conditioning feature instead of a direct input feature of the ASR system. This is achieved by stacking a feature-wise linear modulation (FiLM) layer on each transformer layer of the end-to-end ASR encoder and combining the input and conditioning features. The results indicate that the proposed FiLM training method exhibits greater robustness against noise owing to the use of enhanced speech as conditioning information rather than as direct ASR input.