Emotion transfer aims to extract the emotional state from a reference speech and use it for converting text to speech with the same emotional state. Previous methods usually use a reference encoder, which consists of convolution layers and recurrent neural networks, to extract the emotion features from the reference mel spectrogram. However, these methods fail to extract robust emotion features because they obtain features highly entangled with other components, like speaker identity. This may lead to an emotion mismatch between the reference speech and synthesized target speech. In this paper, we propose WET: a Wav2vec 2.0-based emotion transfer model which improves the capability of emotion feature extraction. By adding auxiliary classifiers, the extracted features can be highly related to the emotion category of the reference speech. In addition, we utilize relative attributes method to control the emotion intensity of synthesized speech which makes the results sound to be more natural. Finally, we evaluate our method using several objective and subjective metrics, and the experimental results show that our proposal can achieve higher accuracy of emotion transfer while ensuring the speech quality.