In data-driven forensic voice comparison (FVC), empirical testing of a system is an essential step to demonstrate validity and reliability. Numerous studies have focused on improving system validity, while studies of reliability are comparatively limited. In the present study, simulated scores were generated from i-vector and GMM-UBM automatic speaker recognition systems using real speech data to demonstrate the variability in system reliability as a function of score skewness, sample size, and calibration methods (logistic regression or a Bayesian model). Using logistic regression with small samples of skewed scores, Cllr range is 1.3 for the i-vector system and 0.69 for the GMM-UBM system. When scores follow a normal distribution, Cllr ranges reduce to 0.49 (i-vector) and 0.69 (GMM-UBM). Using the Bayesian model, the Cllr ranges are 0.31 and 0.60 for i-vector and GMM-UBM systems respectively when scores are skewed, and the Cllr range remains stable when scores follow a normal distribution irrespective of sample size. The results suggests that score skewness has a substantial effect on system reliability. With this in mind, in FVC it may be preferable to use an older generation of system which produces less variable results, but slightly weaker discrimination, especially when sample size is small.