Within the domain of multi-talker recordings, many speech technologies rely on an initial segmentation step of finding when each person was talking. One common approach to this task is Target-Speaker Voice Activity Detection (TS-VAD), in which a model is supplied with a representation corresponding to a particular speaker and then identifies the temporal regions when that person was talking. As in many cases, the increased complexity of this task over regular Voice Activity Detection (VAD) imposes constraints on the data used to train such a model. In this work, we explore conversion of a pre-trained VAD model into a TS-VAD model via an implicitly-trained separation front end—decoupling the need for speaker-discriminative training data from the basic speech/non-speech data used in training VAD models—which can lead to improvements in model robustness and speech recall in the domains present only in the training data of the VAD.