ISCA Archive Interspeech 2011
ISCA Archive Interspeech 2011

Very large vocabulary ASR for spoken Russian with syntactic and morphemic analysis

Alexey Karpov, Irina Kipyatkova, Andrey Ronzhin

In this paper, we present a word-based very large vocabulary automatic speech recognition system for Russian. Some novel methods are proposed for organization of the lexicon and the language model. Two-level morpho-phonemic prefix graph that uses some information on morphemic structure of lexical units is suggested for a compact representation of the pronunciation vocabulary and search space. Such model is more compact than the lexical tree or the linearly-based vocabulary and provides speeding up the recognition process. The syntactic analysis of a training text corpus in a combination with the statistical analysis is suggested for generation of N-gram language models. The syntax-based Russian language model allows taking into account long-distance syntactic dependencies between word pairs. The results have proved that the syntactic-statistic language model gives 5% relative improvement on the word and letter error rates with respect to the baseline models.