Speaker identification(SI) systems based on deep neural network(DNN) have been widely applied in practical tasks. But DNN is vulnerable to imperceptibly adversarial attacks which will typically result in misjudgment and security concern. Hence, the research on adversarial attacks has become a crucial problem to verify the robustness of SI systems. Although existing works have shown that white-box attacks can break through current SI systems, few works have studied the more practical black-box attacks. Moreover, existing transfer attacks on SI systems migrated from computer vision are speaker-unrelated and lack the adaptability to speech data. In this work, we propose a new black-box attack method, called speaker-specific utterance ensemble based transfer attack(SUETA), to attack on SI systems. SUETA is the first work to generate an ensemble of multiple adversarial utterances in the unit of speakers, by utilizing the unique characteristic of speech data that different utterances of one specific speaker share the same voiceprint. Experimental results on three representative SI models show that SUETA can achieve better transfer success rate(TSR) than speaker-unrelated baselines. Furthermore, SUETA can even improve the attack success rate(ASR) of local white-box attacks.