In this paper, we propose a zero-shot voice conversion algorithm adding a number of conditioning signals to explicitly transfer prosody, linguistic content, and dynamics to conversion results. We show that the proposed approach improves overall conversion quality and generalization to out-of-domain samples relative to a baseline implementation of AutoVC, as the inclusion of conditioning signals can help reduce the burden on the model’s encoder to implicitly learn all of the different aspects involved in speech production. An ablation analysis illustrates the effectiveness of the proposed method.