Multimodal emotion recognition (MER) is a rapidly evolving field aimed at integrating information from various modalities, such as speech and text, to deepen our understanding of emotions. However, challenges in feature extraction and fusion hinder further advancements in MER performance. To address these challenges, we propose a MER method using self-supervised representations and handcrafted music theory-inspired representations across different modalities to comprehensively capture emotional information. Additionally, we introduce a novel multimodal fusion method to explore modality-specific and modality-invariant relationships, thereby reducing distribution gaps between different modalities in MER. Extensive experimental validation underscores the effectiveness of our approach, with state-of-the-art results showing a 3.55% improvement compared with the baseline. These results validate the effectiveness of our proposed method, signifying a notable enhancement in MER performance.