ISCA Archive SIGUL 2023
ISCA Archive SIGUL 2023

Collecting Speech Data for Endangered and Under-resourced Indian Languages

Ritesh Kumar, Meiraba Takhellambam, Bornini Lahiri, Amalesh Gope, Shyam Ratan, Neerav Mathur, Siddharth Singh

The preparation of speech corpora for languages un(der)represented on the web largely depends on the manual methods of data collection and processing from different sources. The methods used in field linguistics and documentary linguistics for collecting data from the speech communities provide a valuable set of resources and methodologies for such data collection but these methods were not developed and optimised for large-scale data collection. However, this limitation could be overcome by combining linguistic field methods with crowdsourcing for data collection. In this paper, we discuss two such ongoing projects - SpeeD-TB and SpeeD-IA - in which we are experimenting with different methods and developing software and other infrastructure to rapidly collect speech data in six Tibeto-Burman - Toto, Chokri, Nyishi, Kok Borok, Bodo and Meitei - and four Indo-Aryan - Awadhi, Bhojpuri, Braj and Magahi - languages in India. Till now we have collected over 40 hours of speech data in these languages and over the period of the next year, we plan to collect a total of approximately 1,200 hours of speech data.


doi: 10.21437/SIGUL.2023-4

Cite as: Kumar, R., Takhellambam, M., Lahiri, B., Gope, A., Ratan, S., Mathur, N., Singh, S. (2023) Collecting Speech Data for Endangered and Under-resourced Indian Languages. Proc. 2nd Annual Meeting of the ELRA/ISCA SIG on Under-resourced Languages (SIGUL 2023), 14-18, doi: 10.21437/SIGUL.2023-4

@inproceedings{kumar23_sigul,
  author={Ritesh Kumar and Meiraba Takhellambam and Bornini Lahiri and Amalesh Gope and Shyam Ratan and Neerav Mathur and Siddharth Singh},
  title={{Collecting Speech Data for Endangered and Under-resourced Indian Languages}},
  year=2023,
  booktitle={Proc. 2nd Annual Meeting of the ELRA/ISCA SIG on Under-resourced Languages (SIGUL 2023)},
  pages={14--18},
  doi={10.21437/SIGUL.2023-4}
}