Distant speech recognition is an important problem which is far from
being solved. Reverberation and noise are in the list of main problems
in this area. The most popular methods of dealing with them are data
augmentation and speech enhancement. In this paper, we propose a novel
approach, inspired by modern methods of speaker adaptation.
First of all, a feed-forward
network is trained to classify room impulse responses (RIRs) from speech
recordings. Then this network is used for extracting embeddings, which
we call R-vectors. These R-vectors are appended to input features of
the acoustic model. Due to the lack of labeled data for RIRs classification
task, we propose a self-supervised method of training the network,
which consists of using artificial audio generated by room simulator.
Experimental evaluation was conducted on VOiCES19 and AMI single-channel
tasks as well as CHiME5 multi-channel task. It is shown that the R-vector-adapted
ASR systems achieve up to 14% relative WER reduction. Furthermore,
it is additive with gains from state-of-the-art dereverberation (WPE)
and speaker adaptation (x-vector) techniques.