ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Deep Sparse Conformer for Speech Recognition

Xianchao Wu

Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two macaron-like feed-forward layers with half-step residual connections sandwiches the multi-head self-attention and convolution modules followed by a post layer normalization. We improve Conformer's long-sequence representation ability in two directions, sparser and deeper. We adapt a sparse self-attention mechanism with O(LlogL) in time complexity and memory usage. A deep normalization strategy is utilized when performing residual connections to ensure our training of hundred-level Conformer blocks. On the Japanese CSJ-500h dataset, this deep sparse Conformer achieves respectively CERs of 5.52%, 4.03% and 4.50% on the three evaluation sets and 4.16%, 2.84% and 3.20% when ensembling five deep sparse Conformer variants from 12 to 16, 17, 50, and finally 100 encoder layers.