Speech signals contain a lot of side information (content,stress, etc.), besides the voiceprint statistics. These sessionvariablilites pose a huge challenge for modeling speaker characteristics. To alleviate this problem, we propose a novel frame-constrained training (FCT) strategy in this paper. It enhances the speaker information in frame-level layers for better embedding extraction. More precisely, a similarity matrix is calculated based on the frame-level features among each batch of the training samples, and a FCT loss is obtained through this similarity matrix. Finally, the speaker embedding network is trained by the combination of the FCT loss and the speaker classification loss. Experiments are performed on the VoxCeleb1 and VOiCES databases. The results demonstrate that the proposed training strategy boosts the system performance.