Recently, BERT and Transformer-XL based architectures have achieved
strong results in a range of NLP applications. In this paper, we explore
Transformer architectures — BERT and Transformer-XL — as
a language model for a Finnish ASR task with different rescoring schemes.
We achieve strong results in both an intrinsic and an extrinsic
task with Transformer-XL. Achieving 29% better perplexity and 3% better
WER than our previous best LSTM-based approach. We also introduce a
novel three-pass decoding scheme which improves the ASR performance
by 8%. To the best of our knowledge, this is also the first work (i)
to formulate an alpha smoothing framework to use the non-autoregressive
BERT language model for an ASR task, and (ii) to explore sub-word units
with Transformer-XL for an agglutinative language like Finnish.