Speaker verification is a process of verifying the identity of a user. Research has shown that emotional variability in speech degrades the performance of speaker verification tasks. Prior approaches were more computationally expensive and did not focus on the state-of-the-art speaker representations. In this paper, we propose a novel framework for constructing emotional speaker embeddings. Our framework utilizes pre-trained state-of-the-art feature extractors for speaker and emotion recognition, including both speaker and emotional information in the final embeddings. We present results of speaker verification on emotional speech datasets. We show that fusing ECAPA2 speaker representations and emotional features from emotion2vec with a cross-attention module improves EER by 8.29 percentage points compared to the baseline.