Deep learning has been widely used in multi-modal Speech Emotion Recognition (SER) to learn sentiment-related features by aggregating representations from multiple modes. However, most SOTA methods use attentive fusion or late fusion of data which ignores the possibility of long-term dependencies among data. In this study, we propose a transformer-based SER architecture that fuses modality representations through explicit memory modules, where the information from current inputs is integrated with historical information allowing the model to understand the relative importance of modes over time. We have used Wav2Vec2 and BERT models to extract audio and text features which are then fused together by aggregating features from individual modes with information stored in memory, followed by downstream classification. Following state-of-the-art methods, we evaluate our proposed method on the IEMOCAP dataset and results indicate that memory-based fusion can achieve substantial improvements.