Jointly optimised attention-based encoder-decoder models have yielded
impressive speech recognition results. The recurrent neural network
(RNN) encoder is a key component in such models — it learns the
hidden representations of the inputs. However, it is difficult for
RNNs to model the long sequences characteristic of speech recognition.
To address this, subsampling between stacked recurrent layers of the
encoder is commonly employed. This method reduces the length of the
input sequence and leads to gains in accuracy. However, static subsampling
may both include redundant information and miss relevant information.
We propose using a dynamic subsampling RNN (dsRNN) encoder. Unlike
a statically subsampled RNN encoder, the dsRNN encoder can learn to
skip redundant frames. Furthermore, the skip ratio may vary at different
stages of training, thus allowing the encoder to learn the most relevant
information for each epoch. Although the dsRNN is unidirectional, it
yields lower phone error rates (PERs) than a bidirectional RNN on TIMIT.
The dsRNN encoder has a 16.8% PER on the TIMIT test set, a considerable
improvement over static subsampling methods used with unidirectional
and bidirectional RNN encoders (23.5% and 20.4% PER respectively).