Speech Emotion Recognition (SER) in naturalistic conditions remains a challenging task due to the variability of emotional expression and class imbalances in the real world. As part of the Interspeech-25 SER challenge, we benchmark state-of-the-art large-scale self-supervised speech models on the MSP-Podcast corpus. To extract rich and expressive representations, we systematically investigate fine-tuning strategies, loss functions tailored to mitigate class imbalance, and pre-trained encoder layer freezing techniques to optimize performance. Our findings highlight the impact of these design choices on model robustness and generalization, offering practical guidance for developing SER systems that excel in real-world scenarios.