ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Rethinking the corpus: moving towards dynamic linguistic resources

Andrew Rosenberg

The corpus is an invaluable resource in Spoken and Natural Language Processing. Consistent data sets has allowed for empirical evaluation of competing algorithms. The sharing of high-quality annotated linguistic data has enabled participation and experimentation by a wide range of researchers. However, despite dubbing these annotations as "gold-standard", many corpora contain labeling errors and idiosyncrasies. The current view of the corpus as a static resource make correction of errors and other modifications prohibitively difficult. In this paper, a perspective of the corpus as dynamically changing is advanced. Version control software can provide a mechanism to facilitate this. We highlight the problems of the static view of the corpus through case studies of the Penn Treebank, Switchboard, Hub-4 and Boston University Radio News Corpus.

Index Terms: Linguistic Resources, Opinion paper