Training hidden Markov models for large vocabulary, continuous speech recognition requires large amounts of data. Books on tape are an easily available, very large source of orthographically transcribed speech data. Use of this source of data is problematic for current training algorithms, however, because they require the speech to be first segmented into isolated sentences. In this paper we present a training algorithm to find the maximum likelihood sequence of states in a phonetic HMM model of an unsegmented, unlimited length speech utterance.
The algorithm that we propose has computation time proportional to utterance length, but requires only a fixed amount of memory, independent of utterance length, so that speech input does not have to be segmented into sentences. It considers successive windows on the speech observations. A full Viterbi search is carried on to the end of each window ; then only N paths are retained as the starting paths for the next window. The N survivor paths are not chosen according to their current likelihood, but by looking ahead at their short-time future likelihoods. In practice, we show that N can be reduced to one, while keeping the search optimal, with a limited amount of look-ahead.