Target speaker extraction (TSE) is a technique for separating the target speaker from mixed speech using speaker embedding. However, speaker embeddings may contain, in addition to speaker information, text dependent information and environmental information, such as noise, microphone characteristics, and reverberation, which can decrease TSE performance, especially when the enrollment and target utterances are in different environments. To address this issue, we propose a Transformer-based embedder for centroid estimation, and a role division training method to enhance the training stability of the TSE separator. This embedder estimates the speaker centroid from the enrollment utterance, aiding the separator in extracting the target speaker. The proposed methods considerably improve speech quality and speech recognition performance compared to the baseline.