Convolutional neural networks (CNN) have shown promising results for
end-to-end speech recognition, albeit still behind RNN/transformer
based models in performance. In this paper, we study how to bridge
this gap and go beyond with a novel CNN-RNN-transducer architecture,
which we call ContextNet. ContextNet features a fully convolutional
encoder that incorporates global context information into convolution
layers by adding squeeze-and-excitation modules. In addition, we propose
a simple scaling method that scales the widths of ContextNet that achieves
good trade-off between computation and accuracy.
We demonstrate that
on the widely used Librispeech benchmark, ContextNet achieves a word
error rate (WER) of 2.1%/4.6% without external language model (LM),
1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy
LibriSpeech test sets. This compares to the best previously published
model of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The
superiority of the proposed ContextNet model is also verified on a
much larger internal dataset.