ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Emotion-Aware Audio-Driven Face Animation via Contrastive Feature Disentanglement

Xin Ren, Juan Luo, Xionghu Zhong, Minjie Cai

In this paper, we tackle the problem of audio-driven face animation which aims to synthesize a realistic talking face given a piece of driven speech. Directly modeling the mapping from audio feature to facial expression is challenging, since people tend to have different talking styles with momentary emotion states as well as identity-dependent vocal characteristics. To address this challenge, we propose a contrastive feature disentanglement method for emotion-aware face animation. The key idea is to disentangle the features for speech content, momentary emotion and identity-dependent vocal characteristics from audio features with a contrastive learning strategy. Experiments on public datasets show that our method can generate more realistic facial expression and enables synthesis of diversified face animation with different emotion.