ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

All Ears: Building Self-Supervised Learning based ASR models for Indian Languages at scale

Vasista Sai Lodagala, Abhishek Biswas, Shoutrik Das, Jordan F, S Umesh

The abundance of unlabeled speech and its ease of collection calls for the development of self-supervised learning (SSL) based speech foundation models, which have been effective across several downstream speech tasks. As a part of this work, we curate 29.5K hours of raw speech data across 24 Indian languages and multiple domains, to pre-train SSL models over 5 different architectures. We then fine-tune these models for the downstream Automatic Speech Recognition (ASR) task on 13 Indian languages and evaluate them over diverse benchmarks. In addition we measure the efficacy of these models by evaluating them over the SUPERB benchmark. Our work signifies the need for careful choice of the SSL objectives while emphasizing the benefits of multilingual pretraining. Our pre-trained models out-perform baseline models such as MMS-300M and IndicWav2Vec by 17.3% and 36.0% relative WER improvements respectively, on Indian language ASR.