ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch

Thomas Graave, Zhengyang Li, Timo Lohrenz, Tim Fingscheidt

Today’s end-to-end (E2E) ASR models achieve strong performance when applied to adult speech, but deteriorate on children’s speech. Most E2E ASR models are pre-trained on adult speech, which introduces an age mismatch that can be addressed by finetuning on child data. However, due to limited availability of child datasets, fine-tuning on children’s speech may introduce new domain shifts such as speaking style mismatch. In this work, we explore mixed fine-tuning on partially matched data, namely read adult speech and spontaneous children’s speech, to improve the performance of E2E ASR on read children’s speech. We isolate the individual impact of age mismatch and speaking style mismatch and investigate the use of childrenization of read adult speech. Our proposed method reduces the WER by up to 5% absolute (21% relative) compared to the pre-trained E2E ASR and by roughly 3% absolute (15% relative) compared to individual fine-tuning on partially matched datasets.