Recognizing emotions in speech is essential for improving human-computer interactions, which require understanding and responding to the users' emotional states. Integrating multiple modalities, such as speech and text, enhances the performance of speech emotion recognition systems by providing a varied source of emotional information. In this context, we propose a model that enhances cross-modal transformer fusion by applying focus attention mechanisms to align and combine the salient features of two different modalities, namely, speech and text. The analysis of the disentanglement of the emotional representation various multiple embedding spaces using deep metric learning confirmed that our method shows enhanced emotion recognition performance. Furthermore, the proposed approach was evaluated on the IEMOCAP dataset. Experimental results demonstrated that our model achieves the best performance among other relevant multimodal speech emotion recognition systems.