The objective of this paper is to advance the development of technologies in the fields of speaker recognition and speaker identification by introducing a large labeled audio database VoxTube collected from the open-source media. We propose a fully automated unsupervised approach for audio labeling that requires any pre-trained speaker recognition model. Collected with this approach from videos with CC BY license the VoxTube dataset contains more than 5.000 speakers with more than 4 million utterances pronounced in more than 10 languages. In our paper we show the VoxTube's high generalization ability across multiple domains by evaluating the accuracy metrics on various speaker recognition benchmarks. We also show how well this dataset complements an already existing VoxCeleb2 dataset.