This study investigates fine-tuning self-supervised learning (SSL) models using multi-task learning (MTL) to enhance speech emotion recognition (SER). The framework simultaneously handles four related tasks: emotion recognition, gender recognition, speaker verification, and automatic speech recognition. An innovative co-attention module is introduced to dynamically capture the interactions between features from the primary emotion classification task and auxiliary tasks, enabling context-aware fusion. Moreover, We introduce the Sample Weighted Focal Contrastive (SWFC) loss function to address class imbalance and semantic confusion by adjusting sample weights for difficult and minority samples. The method is validated on the Categorical Emotion Recognition task of the Speech Emotion Recognition in Naturalistic Conditions Challenge, showing significant performance improvements.