The STC ASR System for the VOiCES from a Distance Challenge 2019
Ivan Medennikov, Yuri Khokhlov, Aleksei Romanenko, Ivan Sorokin, Anton Mitrofanov, Vladimir Bataev, Andrei Andrusenko, Tatiana Prisyach, Mariya Korenevskaya, Oleg Petrov, Alexander Zatvornitskiy
This paper is a description of the Speech Technology Center (STC) automatic
speech recognition (ASR) system for the “VOiCES from a Distance
Challenge 2019”. We participated in the Fixed condition of the
ASR task, which means that the only training data available was an
80-hour subset of the LibriSpeech corpus. The main difficulty of the
challenge is a mismatch between clean training data and distant noisy
development/ evaluation data. In order to tackle this, we applied room
acoustics simulation and weighted prediction error (WPE) dereverberation.
We also utilized well-known speaker adaptation using x-vector speaker
embeddings, as well as novel room acoustics adaptation with R-vector
room impulse response (RIR) embeddings. The system used a lattice-level
combination of 6 acoustic models based on different pronunciation dictionaries
and input features. N-best hypotheses were rescored with 3 neural network
language models (NNLMs) trained on both words and sub-word units. NNLMs
were also explored for out-of-vocabulary (OOV) words handling by means
of artificial texts generation. The final system achieved Word Error
Rate (WER) of 14.7% on the evaluation data, which is the best result
in the challenge.
This paper also appears in session Wed-SS-7-3.