Even with modern-day advanced machine learning techniques, Speech Emotion Recognition (SER) is a challenging task. Speech signals alone might not provide enough information to build robust emotion recognition models. The widespread usage of wearable devices provides multiple signal streams containing physiological and contextual cues, which could be incredibly beneficial to improving an SER system. However, research around multimodal emotion recognition with wearable and speech signals is limited. Also, the scarcity of annotated data for such scenarios limits the applicability of deep learning techniques. This paper presents a self-supervised fusion method for speech and wearable signals and evaluates its usage in the SER context. We further discuss three different fusion techniques in the context of multimodal emotion recognition. Our evaluations show that pretraining in the fusion stage significantly impacts the downstream emotion recognition task. Our method was able to achieve F1 Scores of 82.59% (arousal), 83.05% (valence) and 72.95% (emotion categories) for K-EmoCon dataset.