This paper presents an overview of the SpeechRaterTM system of Educational Testing Service (ETS), a fully operational automated scoring system for non-native spontaneous speech employed in a practice context. This novel system stands in contrast to most prior speech scoring systems which focus on fairly predictable, low entropy speech such as read-aloud speech or short and predictable responses.
We motivate our approach by grounding our work in the TOEFL® iBT speaking construct ("what constitutes a speaker's ability to speak comprehensibly, coherently and appropriately?") and rubrics ("what levels of proficiency do we expect to observe for different score levels in different aspects or dimensions of speech?").
SpeechRater consists of three main components: the speech recognizer, trained on about 30 hours of non-native speech, the feature computation module, computing about 40 features predominantly in the fluency dimension, and the scoring model, which combines a selected set of speech features to predict a speaking score using multiple regression. On the task of estimating the total score for a set of three responses, our best model achieves a correlation of 0.67 with human scores and a quadratically weighted kappa of 0.61, which compares to an inter-human correlation of 0.94 and an inter-human weighted kappa of 0.93.