ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

kidsTALC: A Corpus of 3- to 11-year-old German Children’s Connected Natural Speech

Lars Rumberg, Christopher Gebauer, Hanna Ehlert, Maren Wallbaum, Lena Bornholt, Jörn Ostermann, Ulrike Lüdtke

In this paper we present kidsTALC an audio dataset with orthographic and phonetic transcriptions of German children's speech collected to facilitate the development of speech based technological solutions. The dataset is part of a larger project aiming to develop machine-learning applications to support automation in child speech and language assessment for research and clinical purposes. At the same time, the interdisciplinary project was established to increase the accessibility of corpora of continuous child speech in Germany and globally to train accurate automated speech recognition tools for children. In the first stage we collected and transcribed 25 hours of continuous speech from typically developing children aged 3 ½–11 years. Here, we discuss the key features of the dataset, data collection, transcription protocol and future datasets in the project. We also present important statistics of our dataset and will demonstrate the speech recognition performance of one baseline model on the dataset.