Attention-based models have made tremendous progress on end-to-end
automatic speech recognition (ASR) recently. However, the conventional
transformer-based approaches usually generate the sequence results
token by token from left to right, leaving the right-to-left contexts
unexploited. In this work, we introduce a bidirectional speech transformer
to utilize the different directional contexts simultaneously. Specifically,
the outputs of our proposed transformer include a left-to-right target,
and a right-to-left target. In inference stage, we use the introduced
bidirectional beam search method, which can not only generate left-to-right
candidates but also generate right-to-left candidates, and determine
the best hypothesis by the score.
To demonstrate our
proposed speech transformer with a bidirectional decoder (STBD), we
conduct extensive experiments on the AISHELL-1 dataset. The results
of experiments show that STBD achieves a 3.6% relative CER reduction
(CERR) over the unidirectional speech transformer baseline. Besides,
the strongest model in this paper called STBD-Big can achieve 6.64%
CER on the test set, without language model rescoring and any extra
data augmentation strategies.1