This work introduces a multi-encoder joint classification and regression training framework for speech emotion recognition. We present our solution for the Interspeech 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, leveraging a multi-modal, multi-encoder architecture with a fusion module. Our results demonstrate the effectiveness of the multi-task approach for both classification and regression tasks, achieving a top 10 spot in categorical emotion classification and 2nd place in emotional attribute prediction among competing teams. Furthermore, an ablation study shows that employing multi-task learning outperforms separate task-specific training. These findings highlight the potential of multi-task, multi-encoder systems for speech emotion recognition.