ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Prosody Modeling with 3D Visual Information for Expressive Video Dubbing

Zhihan Yang, Shansong Liu, Xu Li, Haozhe Wu, Zhiyong Wu, Ying Shan, Jia Jia

The automatic video dubbing task is proposed to meet personal and industrial demands for dubbing. Current methods mostly focus on duration matching and overlook the synchronization of prosody, and thus lack expressiveness. In this paper, we introduce visual prosody modeling to promote expressiveness for video dubbing, defined as the expression and head pose in 3D space, which has the advantages of 1) high relevance to the tone and stress of utterances; 2) more accurate than 2D images; 3) disentanglement from irrelevant factors such as speaker identity. We propose a 3D-VD (3D Video Dubber) system to incorporate visual prosody, utilizing a visual-text step-wise aligner to control the generated prosody. Experiments demonstrate that the proposed method outperforms previous methods that only consider 2D face images in terms of naturalness, lip-speech alignment, and synchronization of visual and auditory prosody. The case study demonstrates the correlation between expression and pitch.