ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

Data collection and annotation for state-of-the-art NER using unmanaged crowds

Spencer Rothwell, Steele Carter, Ahmad Elshenawy, Vladislavs Dovgalecs, Safiyyah Saleem, Daniela Braga, Bob Kennewick

This paper presents strategies for generating entity level annotated text utterances using unmanaged crowds. These utterances are then used to build state-of-the-art Named Entity Recognition (NER) models, a required component to build dialogue systems. First, a wide variety of raw utterances are collected through a variant elicitation task. We ensure that these utterances are relevant by feeding them back to the crowd for a domain validation task. We also flag utterances with potential spelling errors and verify these errors with the crowd before discarding them. These strategies, combined with a periodic CAPTCHA to prevent automated responses, allow us to collect high quality text utterances despite the inability to use the traditional gold test question approach for spam filtering. These utterances are then tagged with appropriate NER labels using unmanaged crowds. The crowd annotation was 23% more accurate and 29% more consistent than in-house annotation.