In this paper, we tackle the problem of audio-driven face animation which aims to synthesize a realistic talking face given a piece of driven speech. Directly modeling the mapping from audio feature to facial expression is challenging, since people tend to have different talking styles with momentary emotion states as well as identity-dependent vocal characteristics. To address this challenge, we propose a contrastive feature disentanglement method for emotion-aware face animation. The key idea is to disentangle the features for speech content, momentary emotion and identity-dependent vocal characteristics from audio features with a contrastive learning strategy. Experiments on public datasets show that our method can generate more realistic facial expression and enables synthesis of diversified face animation with different emotion.