Deep embedding learning methods have shown state-of-the-art performance for text-independent speaker verification(SV) tasks, compared to the traditional i-vectors. Existing methods mainly focus on designing frame-level feature extraction structures, utterance-level aggregation methods and loss functions to learn effective speaker embeddings. However, due to the locality property of frame-level extraction, the resulting embeddings will be different if we shuffle the sequential order of the input utterance. On the contrary, the conventional i-vector methods are order-insensitive. In this paper, we propose an acoustic feature shuffling network to learn the order-insensitive speaker embeddings via a joint learning method. Specifically, the input utterance is first organized into multi-scale segments. Then, these segments are randomly shuffled to form the input of the deep embedding learning architecture. A symmetric Kullback-Leibler(KL-)Divergence loss, in addition to the Cross-Entropy (CE) loss, is used to force the learned architecture to be order-insensitive. Experimental results of benchmark Voxceleb corpus demonstrate the effectiveness of the proposed acoustic feature shuffling network.