ISCA Archive Eurospeech 2003
ISCA Archive Eurospeech 2003

Compound decomposition in dutch large vocabulary speech recognition

Roeland Ordelman, Arjan van Hessen, Franciska de Jong

This paper addresses compound splitting for Dutch in the context of broadcast news transcription. Language models were created using original text versions and text versions that were decomposed using a data-driven compound splitting algorithm. Language model performances were compared in terms of out-of- vocabulary rates and word error rates in a real-world broadcast news transcription task. It was concluded that compound splitting does improve ASR performance. Best results were obtained when frequent compounds were not decomposed.