Multimodal emotion recognition (MER) integrates data from speech, facial expressions, and text to enhance emotion prediction, with applications in various human-computer interaction scenarios. While recent end-to-end models improved performance, they typically lack interpretability regarding the role of each modality in predicting emotions. This paper introduces a contribution-aware MER (CAMER) with a novel adaptive weighting mechanism that dynamically adjusts the contribution of each modality based on the emotional content using cross-attention between modality features and emotion embeddings. This approach not only enhances the performance of MER by optimizing modality integration but also significantly improves the interpretability of the model's predictions. Evaluations on the IEMOCAP and CMU-MOSEI datasets demonstrate the superiority of our method. Additionally, we provide an interactive demo that allows users to explore and visualize the model's decision-making process.