ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus

Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol

We present the first industrial-scale open-source Kazakh speech corpus for automatic speech recognition research and development. Our corpus subsumes two previously presented corpora: 1) Kazakh speech corpus (KSC) and 2) Kazakh text-to-speech 2 (KazakhTTS2). We also provide additional data from other sources, including television news, television and radio programs, parliament speeches, and podcasts. Our corpus, which we have named KSC2, contains over a thousand hours of high-quality transcribed data, which is triple the size of KSC. KSC2 was manually transcribed with the help of native Kazakh speakers and validated via preliminary speech recognition experiments on various evaluation sets. Moreover, it contains utterances with Kazakh-Russian code-switching, a conversational practice common among Kazakh speakers. We believe that our corpus will facilitate speech processing research for Kazakh, which is widely considered an under-resourced language. To ensure the reproducibility of experiments, we share the KSC2 corpus, training recipes, and pretrained models.