ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition

Artit Suwanbandit, Burin Naowarat, Orathai Sangpetch, Ekapol Chuangsuwanich

We release 840 hours of read speech multi-dialect ASR corpora consisting of 700 hours of main Thai dialect, named Thai-central, and 40 hours for each local dialect , named Thai-dialect, with transcripts and their translations to Thai. The dialects, selected to represent different regions of Thailand, are Khummuang, Korat, and Pattani. We also release the baseline dialectal ASR models trained using the curriculum learning approach. We found that the pre-training with the high-resource main dialect and target dialect generally yields the best performance. We believe that the availability of our corpora would contribute to the problem of low-resource Thai dialects. The corpus data will be available on Github.