ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Audio-to-text alignment for speech recognition with very limited resources

Xavier Anguera, Jordi Luque, Ciro Gracia

In this paper we present our efforts in building a speech recognizer constrained by the availability of very limited resources. We consider that neither proper training databases nor initial acoustic models are available for the target language. Moreover, for the experiments shown here, we use grapheme-based speech recognizers. Most prior work in the area use initial acoustic models, trained on the target or a similar language, to force-align new data and then retrain the models with it. In the proposed approach a speech recognizer is trained from scratch by using audio recordings aligned with (sometimes approximate) text transcripts. All training data has been harvested online (e.g. audiobooks, parliamentary speeches). First, the audio is decoded into a phoneme sequence by an off-the-shelf phonetic recognizer in Hungarian. Phoneme sequences are then aligned to the normalized text transcripts through dynamic programming. Correspondence between phonemes and graphemes is done through a matrix of approximate sound-to-grapheme matching. Finally, the aligned data is split into short audio/text segments and the speech recognizer is trained using Kaldi toolkit. Alignment experiments performed for Catalan and Spanish show the feasibility to obtain accurate alignments that can be used to successfully train a speech recognizer.