ISCA Archive Eurospeech 2003
ISCA Archive Eurospeech 2003

Morpheme-based lexical modeling for korean broadcast news transcription

Young-Hee Park, Dong-Hoon Ahn, Minhwa Chung

In this paper, we describe our LVCSR system for Korean broadcast news transcription. The main focus here is to find the most proper morpheme-based lexical model for Korean broadcast news recognition to deal with the inflectional flexibilities in Korean. Since there are trade-offs between lexicon size and lexical coverage, and between the length of lexical unit and WER, in our system we analyzed the training corpus to obtain a compact 24k-morpheme-based lexicon with 98.8% coverage. Then, the lexicon is optimized by combining morphemes using statistics of training corpus under monosyllable constraint or maximum length constraint. In experiments, our system reduced the number of monosyllable morphemes which are the most error-prone, from 52% to 29% of the lexicon and obtained 13.24% WER for anchor and 24.97% for reporter.