ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Recognize Mispronunciations to Improve Non-Native Acoustic Modeling Through a Phone Decoder Built from One Edit Distance Finite State Automaton

Wei Chu, Yang Liu, Jianwei Zhou

This paper proposed a procedure for detecting and recognizing mispronunciations in training data, and improved non-native acoustic modeling by training with the corrected phone alignments. To start, an initial phone sequence for an utterance is derived from its word-level transcription and a dictionary of canonical pronunciation. Following that, the region of mispronunciation is detected through examining phone-level goodness-of-pronunciation (GOP) scores. Then over the region, a constrained phone decoder is used to recognize the most likely pronounced phone sequence from all the possible phone sequences with one phone edit distance from the initial phone sequence. After updating the phone alignments and GOP scores, this detection and recognition procedure is repeated until no more mispronunciation is detected. Experiments on a 300-hour non-native spontaneous dataset showed that the acoustic model trained from the proposed procedure reduced WER by 6% compared to a well optimized context-dependent factorized-TDNN HMM baseline system with the same neural network topology. This work also offered a data-driven approach for generating a list of common mispronunciation patterns of non-native English learners that may be useful for speech assessment purpose.