Building text-to-speech (TTS) synthesisers for Indian languages is
a difficult task owing to a large number of active languages. Indian
languages can be classified into a finite set of families, prominent
among them, Indo-Aryan and Dravidian. The proposed work exploits this
property to build a generic TTS system using multiple languages from
the same family in an end-to-end framework. Generic systems are quite
robust as they are capable of capturing a variety of phonotactics across
languages. These systems are then adapted to a new language in the
same family using small amounts of adaptation data. Experiments indicate
that good quality TTS systems can be built using only 7 minutes of
adaptation data. An average degradation mean opinion score of 3.98
is obtained for the adapted TTSes.
Extensive analysis
of systematic interactions between languages in the generic TTSes is
carried out. x-vectors are included as speaker embedding to synthesise
text in a particular speaker’s voice. An interesting observation
is that the prosody of the target speaker’s voice is preserved.
These results are quite promising as they indicate the capability of
generic TTSes to handle speaker and language switching seamlessly,
along with the ease of adaptation to a new language.