ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

LAIX Corpus of Chinese Learner English: Towards a Benchmark for L2 English ASR

Yanhong Wang, Huan Luan, Jiahong Yuan, Bin Wang, Hui Lin

This paper introduces a corpus of Chinese Learner English containing 82 hours of L2 English speech by Chinese learners from all major dialect regions, collected through mobile apps developed by LAIX Inc. The LAIX corpus was created to serve as a benchmark dataset for evaluating Automatic Speech Recognition (ASR) performance on L2 English, the first of this kind as far as we know. The paper describes our effort to build the corpus, including corpus design, data selection and transcription. Multiple rounds of quality check were conducted in the transcription process. Transcription errors were analyzed in terms of error types, rounds of reviewing, and learners’ proficiency levels. Word error rates of state-of-the-art ASR systems on the benchmark corpus were also reported.