In task-oriented dialogues, dialogue state tracking (DST) is a critical component as it identifies specific information for the user's purpose. However, as annotating DST data requires a significant amount of human effort, leveraging raw dialogue is crucial. To address this, we propose a new self-training (ST) framework with a verification model. Unlike previous ST methods that rely on extensive hyper-parameter searching to filter out inaccurate data, our verification methodology ensures the accuracy and validity of the dataset without using a fixed threshold. Furthermore, to mitigate overfitting, we augment the dataset by generating diverse user utterances. Even when using only 10% of the labeled data, our approach achieves comparable results to a fully labeled MultiWOZ2.0 dataset. The evaluation of scalability also demonstrates enhanced robustness in predicting unseen values.