This paper proposes a nonparallel emotional speech conversion (ESC) method based on Variational AutoEncoder-Generative Adversarial Network (VAE-GAN). Emotional speech conversion aims at transforming speech from one source emotion to that of a target emotion without changing the speaker’s identity and linguistic content. In this work, an encoder is trained to elicit the content-related representations from acoustic features. Emotion-related representations are extracted in a supervised manner. Then the transformation between emotion-related representations from different domains is learned using an improved cycle-consistent Generative Adversarial Network (CycleGAN). Finally, emotion conversion is performed by eliciting and recombining the content-related representations of the source speech and the emotion-related representations of the target emotion. Subjective evaluation experiments are conducted and the results show that the proposed method outperforms the baseline in terms of voice quality and emotion conversion ability.