ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech Recognition

Belen Alastruey, Lukas Drude, Jahn Heymann, Simon Wiesler

Convolutional frontends are a typical choice for Transformer-based asr to preprocess the spectrogram, reduce its sequence length, and combine local information in time and frequency similarly. However, the width and height of an audio spectrogram denote different information, e.g., due to reverberation as well as the articulatory system, the time axis has a clear left-to-right dependency. On the contrary, vovals and consonants demonstrate very different patterns and occupy almost disjoint frequency ranges. Therefore, we hypothesize, global attention over frequencies is beneficial over local convolution. We obtain 2.4 % rWERR on a production scale Conformer transducer replacing its CNN frontend by the proposed F-Attention module on Alexa traffic. To demonstrate generalizability, we validate this on public LibriSpeech data with an LSTM-based LAS architecture obtaining 4.6 % rWERR and demonstrate robustness to (simulated) noisy conditions.