This paper evaluates and compares different approaches to collecting judgments about pronunciation accuracy of non-native speech. We compare the common approach, which requires expert linguists to provide a detailed phonetic transcription of non-native English speech, with word-level judgments collected from multiple naïve listeners using a crowd-sourcing platform. In both cases we found low agreement between annotators on what words should be marked as errors. We compare the error detection task to a simple transcription task in which the annotators were asked to transcribe the same fragments using standard English spelling. We argue that the transcription task is a simpler and more practical way of collecting annotations which also leads to more valid data for training an automatic scoring system.