Sequence-to-sequence models with an implicit alignment mechanism (e.g.
attention) are closing the performance gap towards traditional hybrid
hidden Markov models (HMM) for the task of automatic speech recognition.
One important factor to improve word error rate in both cases is the
use of an external language model (LM) trained on large text-only corpora.
Language model integration is straightforward with the clear separation
of acoustic model and language model in classical HMM-based modeling.
In contrast, multiple integration schemes have been proposed for attention
models.
In this work, we present a novel method for language model integration
into implicit-alignment based sequence-to-sequence models. Log-linear
model combination of acoustic and language model is performed with
a per-token renormalization. This allows us to compute the full normalization
term efficiently both in training and in testing.
This is compared to
a global renormalization scheme which is equivalent to applying shallow
fusion in training.
The proposed methods show
good improvements over standard model combination (shallow fusion)
on our state-of-the-art Librispeech system. Furthermore, the improvements
are persistent even if the LM is exchanged for a more powerful one
after training.