AI algorithms designed to clone human speaker identity are reportedly capable of replicating human-specific vocal confidence. However, whether listeners can accurately identify a single speaker expressing varying emotive states as the same individual remains unclear, particularly never in AI-to-AI pairings. This study asked thirty-six Chinese participants to judge whether identical speakers delivered pairs of Chinese sentences with incongruent or congruent prosody in human-only and AI-only scenarios. We found a marked decrease in the accuracy of identifying the same speaker under inconsistent prosody conditions compared to consistent ones, a trend evident in both human-to-human and AI-to-AI pairs. Meanwhile, correctly distinguishing between two speakers was more challenging than identifying a single speaker, with AI pairs reporting notably poorer performance than human-human pairs. We observed that listeners slowed down reaction times when faced with inconsistent prosody in the one-speaker scenario, whereas they reacted faster in two-speaker setups. Listeners reacted similarly fast in human and AI trials. Our findings suggest vocal prosodies can lead to within-speaker identity variation but around the average-based representations, which listeners can overcome and still recognise the same speaker across prosodies. Our results about speaker discrimination in AI voices also provide supportive evidence for the “out-group homogeneity effect”.