ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

Bayesian integration of sound source separation and speech recognition: a new approach to simultaneous speech recognition

Kousuke Itakura, Izaya Nishimuta, Yoshiaki Bando, Katsutoshi Itoyama, Kazuyoshi Yoshii

This paper presents a novel Bayesian method that can directly recognize overlapping utterances without explicitly separating mixture signals into their independent components in advance of speech recognition. The conventional approach to contaminated speech recognition in real environments uniquely extracts the clean isolated signals of individual sources ( e.g., by noise reduction, dereverberation, and source separation). One of the main limitations of this cascading approach is that the accuracy of speech recognition is upper bounded by the accuracy of preprocessing. To overcome this limitation, our method marginalizes out uncertain isolated speech signals by integrating source separation and speech recognition in a Bayesian manner. A sufficient number of samples are drawn from the posterior distribution of isolated speech signals by using a Markov chain Monte Carlo method, and then the posterior distributions of uttered texts for those samples are integrated. Under a certain condition, this Monte Carlo integration is shown to reduce to the well-known method called ROVER that integrates recognized texts obtained from sampled speech signals. Results of simultaneous speech recognition experiments showed that in terms of word accuracy the proposed method significantly outperformed conventional cascading methods.