Subjective evaluations such as mean opinion scores (MOS) are essential for evaluations of synthetic speech including automatic speech quality assessment (SQA) models. In this paper, we evaluate the confidence intervals of MOS in a listening test and the number of required samples to achieve a certain confidence interval based on various tail probability evaluation methods. The tail probability is a probability representing the sample mean deviates greatly from the true mean. We use tail probability evaluations based on asymptotic and upper-bound-based approaches. In our experiments about toy data and actual listening test data, we show that achieving small confidence intervals requires huge sample volumes, and the MOS corpus for SQA has large confidence intervals due to limited sample volumes. We suggest adopting comparative scoring and online learning for more reliable subjective evaluations under limited budgets as the future direction.