The Speech Accessibility Project: Best Practices for Collection and Curation of Disordered Speech
Chris Zwilling, Mark Hasegawa-Johnson, Heather Hodges, Lorraine Ramig, Adina Bradshaw, Clarion Mendes, Heejin Kim, Alexandria Barkhimer, Laura Mattie, Meg Dickinson, Shawnise Carter, Marie Moore Channell
Drawing lessons from the collection and curation of disordered speech samples from a large open-source database that includes dysarthria, apraxia, dysphonia, or otherwise atypical speech—the Speech Accessibility Project—this paper provides best practices to guide the collection of diverse speech samples and enhance reproducibility. This paper describes data collection partnerships; the information technology architecture required to collect a massively distributed dataset spanning countries; the data collection process from the participant point of view; and the speech prompts and annotation process used for diverse speech. The open-source data set resulting from this set of best practices will provide a high-quality testbed for researchers around the world to train automatic speech recognition algorithms for disordered speech.