The success of speaker recognition systems heavily depends on large training datasets collected under real-world conditions. While common languages like English or Chinese have vastly available datasets, low-resource ones like Vietnamese remain limited. This paper presents a large-scale spontaneous dataset gathered under noisy environments, with over 87,000 utterances from 1,000 Vietnamese speakers of many professions, covering 3 main Vietnamese dialects. To build the dataset, we propose a sophisticated construction pipeline that can also be applied to other languages, with efficient visual-aided processing techniques to boost data precision. With the state-of-the-art x-vector model, training with the proposed dataset shows an average absolute and relative EER improvement of 5.48% and 41.61% when compared to the model trained on VLSP 2021, a publicly available Vietnamese speaker dataset.