In this paper, we focus on the modeling of coarticulation and pronunciation variation in Automatic Speech Recognition systems (ASR). Most ASR systems explicitly describe these production phenomena through context-dependent phoneme models and multiple pronunciation lexicons.
Here, we explore the potential benefit of using feature spaces covering longer time segments in terms of implicit modeling of coarticulation and pronunciation variants.
The study is based on the analysis at the phonetic level of the performance of context-independent and context-dependent acoustic models, and more particularly the impact of modeling different time context going from 70 ms up to 310 ms on typical cases of pronunciation variants.
Results, confirmed by word recognition experiment, put into light some ability of generic acoustic models to implicitly handle pronunciation variation.