Statistical language models are utilized in many speech processing algorithms, e.g., automatic speech recognition (ASR). Such a model is created from a text corpus, but many of the text corpora for Romanian are unreliable with respect to the use of diacritic marks, i.e., diacritics are either partially or completely missing, resulting in low quality language models. We present a methodology for restoring diacritic marks to an unreliable text corpus, which requires no text resources apart from the corpus itself. The proposed methodology (i) identifies sections of the input corpus which are correct with respect to the use of diacritics, (ii) utilizes these sections to train a diacritics restoration system (DRS), and (iii) utilizes the DRS to correct the remaining sections of the corpus. We compare the DRS trained at (ii) with state-of-the-art systems, and observe up to 12% improvement with regard to the correctness of diacritic restoration. Furthermore, we utilize our methodology to create improved language models for the ASR system developed by the SpeeD laboratory, and demonstrate a decrease of 14% in perplexity and a 20% reduction of the out-of-vocabulary rate as a result.
Index Terms: Diacritics, speech recognition