ISCA Archive SLTU 2012
ISCA Archive SLTU 2012

Developments of Swahili resources for an automatic speech recognition system

Hadrien Gelas, Laurent Besacier, François Pellegrino

This article describes our efforts to provide ASR resources for Swahili, a Bantu language spoken in a wide area of East Africa. We start with an introduction on the language situation, both at linguistic and digital level. Then, we report the selected strategies to develop a text corpus, a pronunciation dictionary and a speech corpus for this under-resourced language. We explore methodologies as crowdsourcing or collaborative transcription process. Besides, we take advantage of some linguistic characteristics of the language such as rich morphology or shared vocabulary with English to improve performance of our baseline Swahili ASR system in a broadcast speech transcription task.

Index Terms: Swahili, under-resourced languages, automatic speech recognition, speech resources