Voice activity detection (VAD) is the task of distinguishing speech from other types of audio signals, such as music or background noise. We introduce a novel end-to-end VAD architecture which incorporates a pre-trained transformer model (Wav2Vec2-XLS-R). We evaluate the proposed architecture on an established VAD dataset, AVA-Speech, and a manually-segmented corpus of under-resourced multilingual speech. As benchmarks, we include a hybrid CNN-BiLSTM system and an off-the-shelf enterprise VAD. On the AVA-Speech test set, our proposed VAD achieves an area under the curve (AUC) of 96.2% while the benchmarks achieve 94.8% and 81.9% respectively. On the multilingual dataset, the gap widens to 92.2% for the transformer-based VAD and 80.8% and 74.6% for the two baselines. Therefore, the proposed VAD offers improved performance in all cases, with an absolute increase of more than 11% for our target domain. We conclude that the proposed end-to-end architecture improves VAD performance.