Effectively exploiting the resources available on modern multicore and manycore processors for tasks such as large vocabulary continuous speech recognition (LVCSR) is far from trivial. While prior works have demonstrated the effectiveness of manycore graphic processing units (GPU) for high-throughput, limited vocabulary speech recognition, they are unsuitable for recognition with large acoustic and language models due to the limited 1-6GB of memory on GPUs. To overcome this limitation, we introduce a novel architecture for WFST-based LVCSR that jointly leverages manycore graphic processing units (GPU) and multicore processors (CPU) to efficiently perform recognition even when large acoustic and language models are applied. In the proposed approach, recognition is performed on the GPU using an H-level WFST, composed using a unigram language model. During decoding partial hypotheses generated over this network are rescored on-the-fly using a large language model stored on the CPU. By maintaining N-best hypotheses during decoding our proposed architecture obtains comparable accuracy to a standard CPU-based WFST decoder while improving decoding speed by over a factor of 10.9x.
Index Terms: Large Vocabulary Continuous Speech Recognition, WFST, On-The-Fly Rescoring, Graphics Processing Units