ISCA Archive SpeechProsody 2024
ISCA Archive SpeechProsody 2024

Can OpenAI’s TTS model convey information status using intonation like humans?

Na Hu, Jiseung Kim, Riccardo Orrico, Stella Gryllia, Amalia Arvaniti

Chatbots powered by Large Language Models (LLMs) such as OpenAI’s ChatGPT have demonstrated impressive capabilities in understanding and generating text and their potential applications in humanities research have been extensively explored. Recently, OpenAI launched its first Text-To-Speech (TTS) model, which has demonstrated the ability to convert text into highly realistic speech. This opens up various potential applications for prosodic research. However, before such applications are in place, a systematic evaluation is needed to determine the extent to which the synthesized speech resembles human speech in terms of prosody. This study aims to contribute to this endeavor by comparing how information status is conveyed by intonation in British English speech synthesized using OpenAI’s TTS model to the speech produced by native speakers of the same English variety. Through Functional Principal Component Analysis (FPCA) and statistical modelling, we found that OpenAI’s TTS model can generate F0 contours with various shapes. However, the F0 contours generated by OpenAI’s TTS model conveying information structure differ from those produced by the human speakers. This indicates that the speech generated by OpenAI’s TTS model may not be ready for use in prosody research, yet.