The goal of voice conversion is to convert the input voice tomatch the target speaker’s voice while keeping text and prosodyintact. Voice conversion is usually used in entertainment andspeaking-aid systems, as well as applied for speech data generation and augmentation. The development of any-to-any voiceconversion systems, which are capable of generating voices unseen during training, is of particular interest to both researchersand the industry. Despite recent progress, any-to-any conversion quality is still inferior to natural speech.In this work, we propose a new any-to-any voice conversionpipeline. To the best of our knowledge, it is the first use of anASR encoder with a GAN training objective in the voice conversion system. We also implement a joint conditional decoder-vocoder model, which simplifies training and improves performance. According to multiple subjective and objective evaluations, our method outperforms modern systems in terms ofvoice quality, similarity, and consistency.