The field of automatic speaker recognition (ASR) has seen a series of generational changes to speaker modelling approaches in the last 3 decades. Adoption of new approaches has mainly been driven by improvements observed in overall system-level performance metrics on common datasets. There is now considerable debate within the field around understanding why systems perform better for some speakers than others. In this study, we compare the performance of 4 generations of ASR systems with the same set of forensically-relevant test and calibration data. On a system- and individual speaker-level, we observe improvements from GMM-UBM to i-vector to x-vector but not for ECAPA-TDNN. We find that certain individuals remain difficult to recognise across all systems. Our findings show that both file- and speaker-level factors contribute to the performance of individual speakers and systems overall, which supports calls for more detailed exploration of system performance.