ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Text aware Emotional Text-to-speech with BERT

Arijit Mukherjee, Shubham Bansal, Sandeepkumar Satpal, Rupesh Mehta

Emotional text to speech is the idea of synthesizing emotional audio via a text-to-speech model. With neural text-to-speech, sentence-level naturalness has improved a lot and is almost at par with human speech, but the current approach to emotional text-to-speech models heavily relies on the user to input the expected emotion along with the text to synthesize the desired speech. In this work, we propose a novel text-aware emotional text-to-speech system that leverages a pre-trained BERT model to get a deep representation of the emotional context from the text both during training and inference. We show that our proposed method synthesizes emotional audio with emotion depending on the emotional context of the input text. We also show that our method outperforms baseline systems in varying the emotional intensity depending on the text.