Voice conversion (VC) transforms an utterance to sound like anotherperson without changing the linguistic content. A recently proposedgenerative adversarial network-based VC method, StarGANv2-VCis very successful in generating natural-sounding conversions.However, the method fails to preserve the emotion of the sourcespeaker in the converted samples. Emotion preservation is necessaryfor natural human-computer interaction. In this paper, we showthat StarGANv2-VC fails to disentangle the speaker and emotionrepresentations, pertinent to preserve emotion. Specifically, thereis an emotion leakage from the reference audio used to capture thespeaker embeddings while training. To counter the problem, wepropose novel emotion-aware losses and an unsupervised methodwhich exploits emotion supervision through latent emotion representations. The objective and subjective evaluations prove the efficacyof the proposed strategy over diverse datasets, emotions, gender, etc.