Convolutional and bidirectional recurrent neural networks have achieved
considerable performance gains as acoustic models in automatic speech
recognition in recent years. Latest architectures unify long short-term
memory, gated recurrent unit and convolutional neural networks by stacking
these different neural network types on each other, and providing short
and long-term features to different depth levels of the network.
For the first time, we propose a unified layer for acoustic modeling
which is simultaneously recurrent and convolutional, and which operates
only on short-term features. Our unified model introduces a bidirectional
gated recurrent unit that uses convolutional operations for the gating
units. We analyze the performance behavior of the proposed layer, compare
and combine it with bidirectional gated recurrent units, deep neural
networks and frequency-domain convolutional neural networks on a 50
hour English broadcast news task. The analysis indicates that the proposed
layer in combination with stacked bidirectional gated recurrent units
outperforms other architectures.