This paper presents a novel multi-task learning framework by introducing self-supervised phonetic information for deep speaker embedding extraction. The primary task is still to classify speakers, but we consider an auxiliary task to identify phoneme boundaries in speech signals following the Noise Contrastive Estimation principle. To further utilize self-supervised information to assist speaker feature learning, the features of intermediate layers in the main task are refined by the features of corresponding layers in the auxiliary task through masking and biasing operations. We use the VoxCeleb1 and CN-Celeb datasets for performance evaluation, which consistently verifies the efficacy of the proposed method.