Speech-based multimodal affective computing has recently attracted significant research attention. Previous experimental results have shown that the audio-only approach exhibits inferior performance than the text-only approach in sentiment analysis and emotion recognition tasks. In this paper, we propose a new strategy to improve the performance of uni-modal and bi-modal affective computing systems via fine-tuning of two pre-trained self-supervised learning models (Text-RoBERTa and Speech-RoBERTa). We fine-tune the models on sentiment analysis and emotion recognition tasks using a shallow architecture, and apply crossmodal attention fusion to the models for further learning and final prediction or classification. We evaluate our proposed method on the CMU-MOSI, CMU-MOSEI and IEMOCAP datasets. The experimental results demonstrate that our approach exhibits superior performance for all benchmarks compared to existing state-of-the-art results, establishing the effectiveness of the proposed method.