ISCA Archive SSPR 2003
ISCA Archive SSPR 2003

Spontaneous speech in the spoken Dutch corpus

Lou Boves, Nelleke Oostdijk

In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a corpus of 1,000 hours of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of (computational) linguistics and language and speech technology. Although the corpus will contain a fair amount of read speech (mainly to train initial acoustic models for speech recognizers), the lion’s share of the data will consist of spontaneous speech, ranging from lectures to unobtrusively recorded conversations. The corpus is unique in that all speech recordings will be made available together with several levels of high quality annotations, from verbatim orthographic transcriptions to syntactic analyses and prosodic labeling.