ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

The Role of Formant and Excitation Source Features in Perceived Naturalness of Low Resource Tribal Language TTS: An Empirical Study

Ashwini Dasare, Pradyoth Hegde, Supritha Shetty, Deepak K T

Text-to-speech synthesis is a prominent area in the speechprocessing domain that has significant use in reading digital content in a given language. In the proposed work, we worked on two tribal languages of India viz., Lambani and Soliga, which are zero-resource languages. The study began with a dataset collection for both tribal languages. Secondly, a Text-To-Speech (TTS) system was built separately based on the transfer learning approach. To validate the voice quality of TTS-generated speech, subjective as well as objective evaluations were performed. As a part of objective analysis, the voice source and vocal tract filter properties of the synthetic speech have been explored. The extensive study on various aspects of speech, such as LP residual, F0 contour, and formants (F1 & sF2) has shown interesting results that can correlate to the subjective listening test results. The link to the original and synthetic speech can be found online.